Needle

Abstract

General Context: The rapid increase of sequencing data in the last years led to an amount of data that is not manageable with current algorithms and data structures. For instance, at the moment it is not possible to find all experiments in the SRA containing certain genes with a specified expression profile. But if the data can not be found in a meaningful way, the information is lost despite the huge affords made to store these experiments.

Tool Description: Needle is a tool for storing sequencing experiments in such a way that approximate quantification of large sequencing data sets is possible.

Needle is based on the Interleaved Bloom Filter (IBF) and its basic idea is to store multiple IBFs for different expression levels.

Links

Please Cite

  • Mitra Darvish, Enrico Seiler, Svenja Mehringer, René Rahn, Knut Reinert, Yann Ponty, “Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments”, vol. 38, iss. 17, 2022-07-08.
    cite this publication
    @article{fu_mi_publications2845,
     abstract = {Motivation
    
    The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data.
    
    Results
    
    As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in {\ensuremath{<}}2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query.
    Availability and implementation
    
    https://github.com/seqan/needle.},
     author = {Mitra Darvish and Enrico Seiler and Svenja Mehringer and Ren{\'e} Rahn and Knut Reinert and Yann Ponty},
     journal = {Bioinformatics},
     month = {July},
     number = {17},
     pages = {4100--4108},
     title = {Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments},
     url = {http://publications.imp.fu-berlin.de/2845/},
     volume = {38},
     year = {2022}
    }

Contact

For questions, comments, or suggestions please contact:

Mitra Darvish mitra.darvish@fu-berlin.de
˄