SLIMM – Species Level Identification of Microbes from Metagenomes
SLIMM is a taxonomic profiling tool that investigates which microorganisms are present in a sequenced sample. SLIMM uses coverage landscape of reference genomes by sequencing reads to remove unlikely genomes from the analysis and subsequently gain more uniquely-mapped reads to assign at lower ranks of a taxonomic tree. This approach enables SLIMM to be sensitive in predicting members of a microbial community while maintaining a low false positive rate. It has been also shown that SLIMM predicts the relative abundances of individual members with lower deviation from the actual relative abundances compared to other methods. SLIMM requires a BAM/SAM alignment file as an input. One can use a read mapper of choice to map raw sequencing reads to obtain the BAM/SAM alignment file required as input for SLIMM.
More details on how to use SLIMM can be found at the SLIMM WIKI
Pre-built executables for Linux and Mac are made available at the RELEASE PAGE .
You may download already INDEXED REFERENCE GENOMES for Bacteria and Archaea groups. (Indexed reference genomes of Bacteria and Archaea groups are provided for Yara and Bowtie2 read mappers.) If you want to use your own custom database follow the instructions provided here.
You can build SLIMM from its source. Instruction on how to build from source can be found at the SLIMM WIKI
If you use SLIMM in your work-flows, don’t forget to cite us.
Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K. (2017) SLIMM: species level identification of microorganisms from metagenomes. PeerJ 5:e3138 https://doi.org/10.7717/peerj.3138
String Similarity Search/Join
String Similarity Search/Join
We present in this paper scalable algorithms for optimal string similarity search and join. Our methods are variations of those applied in Masai, our recently published tool for mapping high-throughput DNA sequencing data with unpreceded speed and accuracy. The key features of our approach are filtration with approximate seeds and methods for multiple backtracking. Approximate seeds, compared to exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds. Combined together, these two methods significantly speed up string similarity search and join operations.
Siragusa, E., Weese D., & Reinert, K. (2013). Scalable String Similarity Search/Join with Approximate Seeds and Multiple Backtracking. EDBT/ICDT ’13, March 18 – 22 2013, Genoa, Italy
Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast but heuristic, and hence often miss significant matches. We provide here the local pairwise aligner STELLAR that has full sensitivity, i.e. guarantees to report all matches of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. For filtering it applies the SWIFT algorithm, for which we have developed a new, exact verification strategy. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments in the edit or hamming distance model.
Kehr, B., Weese, D., Reinert, K. (2011). STELLAR: fast and exact local alignments. BMC Bioinformatics, 12(Suppl 9):S15, 2011.
SeqAn :: T-Coffee
SeqAn::T-Coffee is an open source multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.
Rausch, T., Emde, A.-K., Weese, D., Doring, A., Notredame, C., and Reinert, K. (2008). Segment-based multiple sequence alignment. Bioinformatics, 24(16), i187–192.
Yara – Yet another read aligner
Yara is an exact tool for aligning DNA sequencing reads to reference genomes.
- Exhaustive enumeration of sub-optimal end-to-end alignments under the edit distance.
- Excellent speed, memory footprint and accuracy.
- Accurate mapping quality computation.
- Support for reference genomes consisiting of million of contigs.
- Direct output in SAM/BAM format.
Supported data: Yara has been tested on DNA reads (i.e., Whole Genome, Exome, ChIP-seq, MeDIP-seq) produced by the following sequencing platforms:
- Illumina GA II, HiSeq and MiSeq (single-end and paired-end).
- Life Technologies Ion Torrent Proton and PGM.
Quality trimming is necessary for Ion Torrent reads and recommended for Illumina reads.
- RNA-seq reads spanning splicing sites.
- Long noisy reads (e.g., Pacific Biosciences RSII, Oxford Nanopore MinION).
- Download binaries
- View source code and REAMDE on GitHub (It is strongly recommended to compile Yara from sources)
- Yara is the follow-up of the Masai project. Use of Masai is discouraged. Nonetheless, old Masai binaries can still be downloaded here.
- Siragusa, E., (2015). Approximate string matching for high-throughput sequencing. Free University of Berlin, 2015.
- Siragusa, E., Weese D., and Reinert, K. (2013). Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Research, 2013, 1–8.
Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels.
Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index.
- import of FASTA/FASTQ read and genome files
- 5 output formats (including SAM)
- reads can be of arbitrary length
- supports Hamming and edit distance read mapping with configurable error rates
- supports paired-end read mapping
- configurable and predictable sensitivity (runtime/sensitivity tradeoff)
- key improvements (compared to RazerS):
- multicore parallelization
- additional pigeonhole filter optimized for low error-rates with controllable sensitivity
- banded Myers’ algorithm for verification
- full sensitivity under the definition given in Rabema
- SAM output
Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows.
- Download the binaries
- View the source code and README on GitHub
- The previous version of RazerS can be found here
- Check out our newer, faster read aligner Yara
- Weese, D., Holtgrewe M., & Reinert, K. (2012). RazerS 3: Faster, fully sensitive read mapping. Bioinformatics, 28(20), 2592–2599.
- Weese, D., Emde, A.-K., Rausch, T., Döring, A., & Reinert, K. (2009). RazerS – Fast read mapping with sensitivity control. Genome Research, 19(9), 1646–1654.
We present a read simulator software for Illumina, 454 and Sanger reads. Its features include position specific error rates and base quality values. For Illumina reads, we give a comprehensive analysis with empirical data for the error and quality model. For the other technologies, we use models from the literature. It has been written with performance in mind and can sample reads from large genomes. The C++ source code is extensible, and freely available under the GPL/LGPL.
Holtgrewe, M. (2010). Mason – a read simulator for second generation sequencing data. Technical Report TR-B-10-06, Institut für Mathematik und Informatik, Freie Universität Berlin.
Journal String Tree (JST)
Motivation: Next generation sequencing (NGS) has revolutionized biomedical research in the last decade and led to a continues stream of developments in bioinformatics addressing the need for fast and space efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences which stem from closely related species or are indeed individuals of the same species. Hence the analyzed sequences are very similar. For analyses where local changes in the examined sequence induce only local changes in the results it is obviously desirable to examine identical or similar regions not repeatedly.
Results: In this work we provide a datatype which exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments we show that algorithms which otherwise would scan each reference sequentially can be speeded up by a factor of 115.
- Journaled String Tree data structure and traverser.
- Generic Journaled String Tree finder.
- Online-search functors: Naive, Horspool, Shift-And, Shift-Or, Myers’ Bitvector.
- GDF converter to convert vcf files into our Genome Delta Format.
Availability: The data structure and associated tools are publicly available (see LINKS) and are part of SeqAn, the C++ template library for sequence analysis. The current stable version is based on SeqAn 1.4.2 and is going to be ported to SeqAn 2.0.0 in the near future.
- You can find the source code on GitHub for the Journaled String Tree on here
- Take a look at the JST finder interface here.
- JST Data (v0.1) – An example gdf file of 1092 human chr1 sequences from the 1000 Genomes Project, the reference sequence and some pattern examples.
- JST Bench (v0.2) – Linux x86 64 bit binaries of the JST benchmark tool built on Debian Wheezy.
Rahn, R., Weese D., & Reinert, K. (2014). Journaled String Tree – A scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics.
If you are not redirected automatically, go to http://seqan.github.io/lambda.
Lambda is a local aligner optimized for many query sequences and searches in protein space. It is compatible to BLAST, but much faster than BLAST and many other comparable tools.
Downloads are available from the sidebar on the left. Lambda is Free and open source software, so you can use it for any purpose, free of charge. However certain conditions apply when you (re-)distribute or modify Lambda, please respect the license. Also, please cite the publication if you use Lambda anywhere in your academic work. Thank you!
The manual, build instructions and much more are available in the WIKI.
Lambda: the local aligner for massive biological data; Hannes Hauswedell, Jochen Singer, Knut Reinert; Bioinformatics 2014 30 (17): i349-i355; doi: 10.1093/bioinformatics/btu439
Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence.
We present GUSTAF, a sound generic multi-split detection method. GUSTAF uses SeqAn’s exact local aligner STELLAR to find partial read alignments. Compatible partial alignments are identified, and a split-read graph storing all compatibility information is constructed for each read. Vertices in the graph represent partial alignments, edges represent possible split positions. Using an exact dynamic programming approach, we refine the alignments around possible split positions to determine precise breakpoint locations at single-nucleotide level. We use a DAG shortest path algorithm to determine the best combination of refined alignments, and report those breakpoints supported by multiple reads.
Usage: STELLAR is not a read mapper, and hence, GUSTAF is not designed to replace any read mapper pipeline with SV detection on top. We recommend doing read mapping with your favourite read mapper and then run STELLAR and GUSTAF, seperately, on the remaining unmappable reads.
Please take a look at the README file for usage instructions.
- Download the binaries
- View the source code and README on GitHub
- Benchmark Data (v1.0) – The data used for obtaining the results of the 2014 paper.
- Trappe, K., Emde, A.-K., Ehrlich, H.-C., Reinert, K. (2014). Gustaf: Detecting and correctly classifiying SVs in the NGS twilight zone. Bioinformatics.
- Trappe, K. (2012). Multi-Split Mapping of NGS Reads for Variant Detection. Master’s thesis, Freie Universitaet Berlin.