Yara

Abstract

Yara is an exact tool for aligning DNA sequencing reads to reference genomes.

Main features:

  • Exhaustive enumeration of sub-optimal end-to-end alignments under the edit distance.
  • Excellent speed, memory footprint and accuracy.
  • Accurate mapping quality computation.
  • Support for reference genomes consisiting of million of contigs.
  • Direct output in SAM/BAM format.

Supported data:

Yara has been tested on DNA reads (i.e., Whole Genome, Exome, ChIP-seq, MeDIP-seq) produced by the following sequencing platforms:

  • Illumina GA II, HiSeq and MiSeq (single-end and paired-end).
  • Life Technologies Ion Torrent Proton and PGM.

Quality trimming is necessary for Ion Torrent reads and recommended for Illumina reads.

Unsupported data:

  • RNA-seq reads spanning splicing sites.
  • Long noisy reads (e.g., Pacific Biosciences RSII, Oxford Nanopore MinION).

Previous applications:

Yara is the follow-up of the Masai project. Use of Masai is discouraged. Nonetheless, old Masai binaries can still be downloaded here.

Links

Please Cite

  • E. Siragusa, D. Weese, K. Reinert, “Fast and accurate read mapping with approximate seeds and multiple backtracking”, vol. 41, iss. 7, 2013-01-28.
    cite this publication
    @article{fu_mi_publications1161,
     abstract = {We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2?4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai.},
     author = {E. Siragusa and D. Weese and K. Reinert},
     journal = {Oxford Journals},
     month = {January},
     number = {7},
     pages = {e78},
     publisher = {Oxford University Press},
     title = {Fast and accurate read mapping with approximate seeds and multiple backtracking},
     url = {http://publications.imp.fu-berlin.de/1161/},
     volume = {41},
     year = {2013}
    }
  • Enrico Siragusa, “Approximate string matching for high-throughput sequencing”, p. 127, 2015-07-23.
    cite this publication
    @phdthesis{fu_mi_publications2507,
     abstract = {Over the past years, high-throughput sequencing (HTS) has become an invaluable method of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual?s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally efficient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design efficient and accurate read mapping tools. In the first part of this thesis, I cover practical indexing and filtering methods for exact and approximate string matching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementations within SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the first study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of read mapping tools. First, I present a novel method to find all mapping locations per read within a user- defined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamed Masai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-defined error rate; this method, packaged in a tool called Yara, provides a more practical, yet sound solution to the read mapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.},
     author = {Enrico Siragusa},
     month = {July},
     school = {Freie Universit{\"a}t Berlin},
     title = {Approximate string matching for high-throughput sequencing},
     url = {http://publications.imp.fu-berlin.de/2507/},
     year = {2015}
    }

Contact

For questions, comments, or suggestions please contact:

Enrico Siragusa enrico.siragusa@fu-berlin.de
˄