Official

­

String Similarity Search/Join

String Similarity Search/Join

Abstract

We present in this paper scalable algorithms for optimal string similarity search and join. Our methods are variations of those applied in Masai, our recently published tool for mapping high-throughput DNA sequencing data with unpreceded speed and accuracy. The key features of our approach are filtration with approximate seeds and methods for multiple backtracking. Approximate seeds, compared to exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds. Combined together, these two methods significantly speed up string similarity search and join operations.

Links

Please Cite

Siragusa, E., Weese D., & Reinert, K. (2013). Scalable String Similarity Search/Join with Approximate Seeds and Multiple Backtracking. EDBT/ICDT ’13, March 18 – 22 2013, Genoa, Italy

Stellar

STELLAR

Abstract

Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast but heuristic, and hence often miss significant matches. We provide here the local pairwise aligner STELLAR that has full sensitivity, i.e. guarantees to report all matches of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. For filtering it applies the SWIFT algorithm, for which we have developed a new, exact verification strategy. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments in the edit or hamming distance model.

Links

Please Cite

Kehr, B., Weese, D., Reinert, K. (2011). STELLAR: fast and exact local alignments. BMC Bioinformatics, 12(Suppl 9):S15, 2011.

 

SeqAn T-Coffee

SeqAn :: T-Coffee

Abstract

SeqAn::T-Coffee is an open source multiple sequence alignment program. SeqAn::T-Coffee aligns amino acid, DNA and RNA sequences using a consistency-based progressive alignment algorithm on a graph of sequence segments.

Links

Please Cite

Rausch, T., Emde, A.-K., Weese, D., Doring, A., Notredame, C., and Reinert, K. (2008). Segment-based multiple sequence alignment. Bioinformatics, 24(16), i187–192.

Yara

Yara – Yet another read aligner

Yara is an exact tool for aligning DNA sequencing reads to reference genomes.

Main features:

  • Exhaustive enumeration of sub-optimal end-to-end alignments under the edit distance.
  • Excellent speed, memory footprint and accuracy.
  • Accurate mapping quality computation.
  • Support for reference genomes consisiting of million of contigs.
  • Direct output in SAM/BAM format.

Supported data: Yara has been tested on DNA reads (i.e., Whole Genome, Exome, ChIP-seq, MeDIP-seq) produced by the following sequencing platforms:

  • Illumina GA II, HiSeq and MiSeq (single-end and paired-end).
  • Life Technologies Ion Torrent Proton and PGM.

Quality trimming is necessary for Ion Torrent reads and recommended for Illumina reads.

Unsupported data:

  • RNA-seq reads spanning splicing sites.
  • Long noisy reads (e.g., Pacific Biosciences RSII, Oxford Nanopore MinION).

Links

  • Download binaries
  • View source code and REAMDE on GitHub (It is strongly recommended to compile Yara from sources)
  • Yara is the follow-up of the Masai project. Use of Masai is discouraged. Nonetheless, old Masai binaries can still be downloaded here.

Please Cite

RazerS 3

RazerS 3

Abstract

Motivation: During the last years NGS sequencing has become a key technology for many applications in the biomedical sciences. Throughput continues to increase and new protocols provide longer reads than currently available. In almost all applications, read mapping is a first step. Hence, it is crucial to have algorithms and implementations that perform fast, with high sensitivity, and are able to deal with long reads and a large absolute number of indels.

Results: RazerS is a read mapping program with adjustable sensitivity based on counting q-grams. In this work we propose the successor RazerS 3 which now supports shared-memory parallelism, an additional seed-based filter with adjustable sensitivity, a much faster, banded version of the Myers’ bit-vector algorithm for verification, memory saving measures and support for the SAM output format. This leads to a much improved performance for mapping reads, in particular long reads with many errors. We extensively compare RazerS 3 with other popular read mappers and show that its results are often superior to them in terms of sensitivity while exhibiting practical and often competetive run times. In addition, RazerS 3 works without a precomputed index.

Main Features:

  • import of FASTA/FASTQ read and genome files
  • 5 output formats (including SAM)
  • reads can be of arbitrary length
  • supports Hamming and edit distance read mapping with configurable error rates
  • supports paired-end read mapping
  • configurable and predictable sensitivity (runtime/sensitivity tradeoff)
  • key improvements (compared to RazerS):
    • multicore parallelization
    • additional pigeonhole filter optimized for low error-rates with controllable sensitivity
    • banded Myers’ algorithm for verification
    • full sensitivity under the definition given in Rabema
    • SAM output

Availability and Implementation: Source code and binaries are freely available for download at http://www.seqan.de/projects/razers. RazerS 3 is implemented in C++ and OpenMP under a GPL license using the SeqAn library and supports Linux, Mac OS X, and Windows.

Links

  • Download the binaries
  • View the source code and README on GitHub
  • The previous version of RazerS can be found here
  • Check out our newer, faster read aligner Yara

Please Cite

Mason

Mason

Abstract

We present a read simulator software for Illumina, 454 and Sanger reads. Its features include position specific error rates and base quality values. For Illumina reads, we give a comprehensive analysis with empirical data for the error and quality model. For the other technologies, we use models from the literature. It has been written with performance in mind and can sample reads from large genomes. The C++ source code is extensible, and freely available under the GPL/LGPL.

Links

Please Cite

Holtgrewe, M. (2010). Mason – a read simulator for second generation sequencing data. Technical Report TR-B-10-06, Institut für Mathematik und Informatik, Freie Universität Berlin.

JST

Journal String Tree (JST)

Abstract

Motivation: Next generation sequencing (NGS) has revolutionized biomedical research in the last decade and led to a continues stream of developments in bioinformatics addressing the need for fast and space efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences which stem from closely related species or are indeed individuals of the same species. Hence the analyzed sequences are very similar. For analyses where local changes in the examined sequence induce only local changes in the results it is obviously desirable to examine identical or similar regions not repeatedly.

Results: In this work we provide a datatype which exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments we show that algorithms which otherwise would scan each reference sequentially can be speeded up by a factor of 115.

Main Features:

  • Journaled String Tree data structure and traverser.
  • Generic Journaled String Tree finder.
  • Online-search functors: Naive, Horspool, Shift-And, Shift-Or, Myers’ Bitvector.
  • GDF converter to convert vcf files into our Genome Delta Format.

Availability: The data structure and associated tools are publicly available (see LINKS) and are part of SeqAn, the C++ template library for sequence analysis. The current stable version is based on SeqAn 1.4.2 and is going to be ported to SeqAn 2.0.0 in the near future.

Links

  • You can find the source code on GitHub for the Journaled String Tree on here
  • Take a look at the JST finder interface here.
  • JST Data (v0.1) – An example gdf file of 1092 human chr1 sequences from the 1000 Genomes Project, the reference sequence and some pattern examples.
  • JST Bench (v0.2) – Linux x86 64 bit binaries of the JST benchmark tool built on Debian Wheezy.

Please Cite

Rahn, R., Weese D., & Reinert, K. (2014). Journaled String Tree – A scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics.

LAMBDA

If you are not redirected automatically, go to http://seqan.github.io/lambda.

<--

LAMBDA

Abstract

Lambda is a local aligner optimized for many query sequences and searches in protein space. It is compatible to BLAST, but much faster than BLAST and many other comparable tools.

Downloads are available from the sidebar on the left. Lambda is Free and open source software, so you can use it for any purpose, free of charge. However certain conditions apply when you (re-)distribute or modify Lambda, please respect the license. Also, please cite the publication if you use Lambda anywhere in your academic work. Thank you!

The manual, build instructions and much more are available in the WIKI.

Links

Please Cite

Lambda: the local aligner for massive biological data; Hannes Hauswedell, Jochen Singer, Knut Reinert; Bioinformatics 2014 30 (17): i349-i355; doi: 10.1093/bioinformatics/btu439

–>

Gustaf

Gustaf

Abstract

Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence.
We present GUSTAF, a sound generic multi-split detection method. GUSTAF uses SeqAn’s exact local aligner STELLAR to find partial read alignments. Compatible partial alignments are identified, and a split-read graph storing all compatibility information is constructed for each read. Vertices in the graph represent partial alignments, edges represent possible split positions. Using an exact dynamic programming approach, we refine the alignments around possible split positions to determine precise breakpoint locations at single-nucleotide level. We use a DAG shortest path algorithm to determine the best combination of refined alignments, and report those breakpoints supported by multiple reads.

Usage: STELLAR is not a read mapper, and hence, GUSTAF is not designed to replace any read mapper pipeline with SV detection on top. We recommend doing read mapping with your favourite read mapper and then run STELLAR and GUSTAF, seperately, on the remaining unmappable reads.
Please take a look at the README file for usage instructions.

Links

Please Cite

Fiona

Fiona: A parallel and automatic strategy for read error correction

Abstract

Fiona is a tool for the automatic correction of sequencing errors in reads produced by high throughput sequencing experiments. It uses an efficient implementation of suffix arrays to detect read overlaps with different seed lengths in parallel. Fiona was compared on several real datasets to state-of-the-art methods and  showed overall superior correction accuracy. It was also among the fastest. Additionaly Fiona embarks unique characteristics which makes it a good choice over existing programs:

  • No parameters to set for the user. You just need to know the length of the genome!
  • Correction of both substitution and indel errors.
  • Optimal correction over a range of seed values.
  • Multicore-Parallelization using OpenMP.
  • Efficient, memory-saving implementation.

Links

Please Cite

Schulz M.H., Weese D., Holtgrewe M., Dimitrova V., Niu S., Reinert K., & Richard H. (2014) Fiona: a parallel and automatic strategy for read error correction. Bioinformatics (2014) 30 (17): i356-i363