SeqAn :: T-Coffee
SeqAn::T-Coffee is an open source multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.
Rausch, T., Emde, A.-K., Weese, D., Doring, A., Notredame, C., and Reinert, K. (2008). Segment-based multiple sequence alignment. Bioinformatics, 24(16), i187–192.
Yara – Yet another read aligner
Yara is an exact tool for aligning DNA sequencing reads to reference genomes.
- Exhaustive enumeration of sub-optimal end-to-end alignments under the edit distance.
- Excellent speed, memory footprint and accuracy.
- Accurate mapping quality computation.
- Support for reference genomes consisiting of million of contigs.
- Direct output in SAM/BAM format.
Supported data: Yara has been tested on DNA reads (i.e., Whole Genome, Exome, ChIP-seq, MeDIP-seq) produced by the following sequencing platforms:
- Illumina GA II, HiSeq and MiSeq (single-end and paired-end).
- Life Technologies Ion Torrent Proton and PGM.
Quality trimming is necessary for Ion Torrent reads and recommended for Illumina reads.
- RNA-seq reads spanning splicing sites.
- Long noisy reads (e.g., Pacific Biosciences RSII, Oxford Nanopore MinION).
- Download binaries
- View source code and REAMDE on GitHub (It is strongly recommended to compile Yara from sources)
- Yara is the follow-up of the Masai project. Use of Masai is discouraged. Nonetheless, old Masai binaries can still be downloaded here.
- Siragusa, E., (2015). Approximate string matching for high-throughput sequencing. Free University of Berlin, 2015.
- Siragusa, E., Weese D., and Reinert, K. (2013). Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Research, 2013, 1–8.
Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels.
Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index.
- import of FASTA/FASTQ read and genome files
- 5 output formats (including SAM)
- reads can be of arbitrary length
- supports Hamming and edit distance read mapping with configurable error rates
- supports paired-end read mapping
- configurable and predictable sensitivity (runtime/sensitivity tradeoff)
- key improvements (compared to RazerS):
- multicore parallelization
- additional pigeonhole filter optimized for low error-rates with controllable sensitivity
- banded Myers’ algorithm for verification
- full sensitivity under the definition given in Rabema
- SAM output
Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows.
- Download the binaries
- View the source code and README on GitHub
- The previous version of RazerS can be found here
- Check out our newer, faster read aligner Yara
- Weese, D., Holtgrewe M., & Reinert, K. (2012). RazerS 3: Faster, fully sensitive read mapping. Bioinformatics, 28(20), 2592–2599.
- Weese, D., Emde, A.-K., Rausch, T., Döring, A., & Reinert, K. (2009). RazerS – Fast read mapping with sensitivity control. Genome Research, 19(9), 1646–1654.
We present a read simulator software for Illumina, 454 and Sanger reads. Its features include position specific error rates and base quality values. For Illumina reads, we give a comprehensive analysis with empirical data for the error and quality model. For the other technologies, we use models from the literature. It has been written with performance in mind and can sample reads from large genomes. The C++ source code is extensible, and freely available under the GPL/LGPL.
Holtgrewe, M. (2010). Mason – a read simulator for second generation sequencing data. Technical Report TR-B-10-06, Institut für Mathematik und Informatik, Freie Universität Berlin.
Journal String Tree (JST)
Motivation: Next generation sequencing (NGS) has revolutionized biomedical research in the last decade and led to a continues stream of developments in bioinformatics addressing the need for fast and space efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences which stem from closely related species or are indeed individuals of the same species. Hence the analyzed sequences are very similar. For analyses where local changes in the examined sequence induce only local changes in the results it is obviously desirable to examine identical or similar regions not repeatedly.
Results: In this work we provide a datatype which exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments we show that algorithms which otherwise would scan each reference sequentially can be speeded up by a factor of 115.
- Journaled String Tree data structure and traverser.
- Generic Journaled String Tree finder.
- Online-search functors: Naive, Horspool, Shift-And, Shift-Or, Myers’ Bitvector.
- GDF converter to convert vcf files into our Genome Delta Format.
Availability: The data structure and associated tools are publicly available (see LINKS) and are part of SeqAn, the C++ template library for sequence analysis. The current stable version is based on SeqAn 1.4.2 and is going to be ported to SeqAn 2.0.0 in the near future.
- You can find the source code on GitHub for the Journaled String Tree on here
- Take a look at the JST finder interface here.
- JST Data (v0.1) – An example gdf file of 1092 human chr1 sequences from the 1000 Genomes Project, the reference sequence and some pattern examples.
- JST Bench (v0.2) – Linux x86 64 bit binaries of the JST benchmark tool built on Debian Wheezy.
Rahn, R., Weese D., & Reinert, K. (2014). Journaled String Tree – A scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics.
If you are not redirected automatically, go to http://seqan.github.io/lambda.
Lambda is a local aligner optimized for many query sequences and searches in protein space. It is compatible to BLAST, but much faster than BLAST and many other comparable tools.
Downloads are available from the sidebar on the left. Lambda is Free and open source software, so you can use it for any purpose, free of charge. However certain conditions apply when you (re-)distribute or modify Lambda, please respect the license. Also, please cite the publication if you use Lambda anywhere in your academic work. Thank you!
The manual, build instructions and much more are available in the WIKI.
Lambda: the local aligner for massive biological data; Hannes Hauswedell, Jochen Singer, Knut Reinert; Bioinformatics 2014 30 (17): i349-i355; doi: 10.1093/bioinformatics/btu439
Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence.
We present GUSTAF, a sound generic multi-split detection method. GUSTAF uses SeqAn’s exact local aligner STELLAR to find partial read alignments. Compatible partial alignments are identified, and a split-read graph storing all compatibility information is constructed for each read. Vertices in the graph represent partial alignments, edges represent possible split positions. Using an exact dynamic programming approach, we refine the alignments around possible split positions to determine precise breakpoint locations at single-nucleotide level. We use a DAG shortest path algorithm to determine the best combination of refined alignments, and report those breakpoints supported by multiple reads.
Usage: STELLAR is not a read mapper, and hence, GUSTAF is not designed to replace any read mapper pipeline with SV detection on top. We recommend doing read mapping with your favourite read mapper and then run STELLAR and GUSTAF, seperately, on the remaining unmappable reads.
Please take a look at the README file for usage instructions.
- Download the binaries
- View the source code and README on GitHub
- Benchmark Data (v1.0) – The data used for obtaining the results of the 2014 paper.
- Trappe, K., Emde, A.-K., Ehrlich, H.-C., Reinert, K. (2014). Gustaf: Detecting and correctly classifiying SVs in the NGS twilight zone. Bioinformatics.
- Trappe, K. (2012). Multi-Split Mapping of NGS Reads for Variant Detection. Master’s thesis, Freie Universitaet Berlin.
Fiona: A parallel and automatic strategy for read error correction
Fiona is a tool for the automatic correction of sequencing errors in reads produced by high throughput sequencing experiments. It uses an efficient implementation of suffix arrays to detect read overlaps with different seed lengths in parallel. Fiona was compared on several real datasets to state-of-the-art methods and showed overall superior correction accuracy. It was also among the fastest. Additionaly Fiona embarks unique characteristics which makes it a good choice over existing programs:
- No parameters to set for the user. You just need to know the length of the genome!
- Correction of both substitution and indel errors.
- Optimal correction over a range of seed values.
- Multicore-Parallelization using OpenMP.
- Efficient, memory-saving implementation.
Schulz M.H., Weese D., Holtgrewe M., Dimitrova V., Niu S., Reinert K., & Richard H. (2014) Fiona: a parallel and automatic strategy for read error correction. Bioinformatics (2014) 30 (17): i356-i363
ANISE and BASIL
ANISE and BASIL
Motivation: Large insertions of novel sequence are an important type of structural variants. Previous studies used traditional de novo assemblers for assembling non-mapping high-throughput sequencing (HTS) or capillary reads and then tried to anchor them in the reference using paired read information.
Results: We present approaches for detecting insertion breakpoints and targeted assembly of large insertions from HTS paired data: BASIL and ANISE. On near identity repeats that are hard for assemblers, ANISE employs a repeat resolution step. This results in far better reconstructions than obtained by ABYSS. On simulated data, we found our insert assembler to be competitive with the de novo assembler ABYSS while yielding already anchored inserted sequence as opposed to unanchored contigs as from ABYSS. On real-world data, we detected novel sequence in a human individual and thoroughly validated the assembled sequence.
Holtgrewe, Manuel, Leon Kuchenbecker, and Knut Reinert. “Methods for the detection and assembly of novel sequence in high-throughput sequencing data.” Bioinformatics (2015) 31 (12): 1904-1912.