Journal String Tree (JST)

Abstract

Motivation: Next generation sequencing (NGS) has revolutionized biomedical research in the last decade and led to a continues stream of developments in bioinformatics addressing the need for fast and space efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences which stem from closely related species or are indeed individuals of the same species. Hence the analyzed sequences are very similar. For analyses where local changes in the examined sequence induce only local changes in the results it is obviously desirable to examine identical or similar regions not repeatedly.

Results: In this work we provide a datatype which exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments we show that algorithms which otherwise would scan each reference sequentially can be speeded up by a factor of 115.

Main Features:

  • Journaled String Tree data structure and traverser.
  • Generic Journaled String Tree finder.
  • Online-search functors: Naive, Horspool, Shift-And, Shift-Or, Myers’ Bitvector.
  • GDF converter to convert vcf files into our Genome Delta Format.

Availability: The data structure and associated tools are publicly available (see LINKS) and are part of SeqAn, the C++ template library for sequence analysis. The current stable version is based on SeqAn 1.4.2 and is going to be ported to SeqAn 2.0.0 in the near future.

Links

Please Cite

  • R. Rahn, D. Weese, K. Reinert, “Journaled string tree--a scalable data structure for analyzing thousands of similar genomes on your laptop”, 2014-07-15.
    cite this publication
    @article{fu_mi_publications1448,
     abstract = {Motivation: Next-generation sequencing (NGS) has revolutionized biomedical research in the past decade and led to a continuous stream of developments in bioinformatics, addressing the need for fast and space-efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences that stem from closely related species or are indeed individuals of the same species. Hence, the analyzed sequences are similar. For analyses where local changes in the examined sequence induce only local changes in the results, it is obviously desirable to examine identical or similar regions not repeatedly.
    
    Results: In this work, we provide a datatype that exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments, we show that algorithms that otherwise would scan each reference sequentially can be speeded up by a factor of 115.
    
    Availability: The data structure and associated tools are publicly available at http://www.seqan.de/projects/jst and are part of SeqAn, the C++ template library for sequence analysis.
    
    Contact: rene.rahn@fu-berlin.de},
     author = {R. Rahn and D. Weese and K. Reinert},
     journal = {Bioinformatics},
     month = {July},
     title = {Journaled string tree--a scalable data structure for analyzing thousands of similar genomes on your laptop},
     url = {http://publications.imp.fu-berlin.de/1448/},
     year = {2014}
    }

Contact

For questions, comments, or suggestions please contact:

René Rahn rene.rahn@fu-berlin.de
˄