Gustaf

Abstract

Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence. We present GUSTAF, a sound generic multi-split detection method. GUSTAF uses SeqAn’s exact local aligner STELLAR to find partial read alignments. Compatible partial alignments are identified, and a split-read graph storing all compatibility information is constructed for each read. Vertices in the graph represent partial alignments, edges represent possible split positions. Using an exact dynamic programming approach, we refine the alignments around possible split positions to determine precise breakpoint locations at single-nucleotide level. We use a DAG shortest path algorithm to determine the best combination of refined alignments, and report those breakpoints supported by multiple reads.

Usage: STELLAR is not a read mapper, and hence, GUSTAF is not designed to replace any read mapper pipeline with SV detection on top. We recommend doing read mapping with your favourite read mapper and then run STELLAR and GUSTAF, seperately, on the remaining unmappable reads. Please take a look at the README file for usage instructions.

Links

Please Cite

  • K. Trappe, A.-K. Emde, H.-C. Ehrlich, K. Reinert, “Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone”, vol. 30, iss. 24, 2014-07-14.
    cite this publication
    @article{fu_mi_publications1455,
     abstract = {MOTIVATION:
    The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs.
    
    RESULTS:
    We present Gustaf (Generic mUlti-SpliT Alignment Finder), a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of {\ensuremath{<}}span class='mathrm'{\ensuremath{>}}ge{\ensuremath{<}}/span{\ensuremath{>}}30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. We show that Gustaf correctly identifies SVs, especially in the range from 30 to 100 bp, which we call the next-generation sequencing (NGS) twilight zone of SVs, as well as larger SVs \&gt;500 bp. Gustaf performs better than similar tools in our benchmark and is furthermore able to correctly identify size and location of dispersed duplications and translocations, which otherwise might be wrongly classified, for example, as large deletions. Availability and implementation: Project information, paper benchmark and source code are available via http://www.seqan.de/projects/gustaf/.
    
    CONTACT:kathrin.trappe@fu-berlin.de.},
     author = {K. Trappe and A.-K. Emde and H.-C. Ehrlich and K. Reinert},
     journal = {Bioinformatics},
     month = {July},
     number = {24},
     pages = {3484--3490},
     publisher = {Oxford University Press},
     title = {Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone},
     url = {http://publications.imp.fu-berlin.de/1455/},
     volume = {30},
     year = {2014}
    }

Contact

For questions, comments, or suggestions please contact:

Kathrin Trappe kathrin.trappe@fu-berlin.de
˄