Vaquita, fast and Accurate Identification of Structural Variation Using Combined Evidence

Abstract

Motivation: Comprehensive identification of structural variations (SVs) is a crucial task for studying genetic diversity and diseases. However, it remains challenging. There is only a marginal consensus between different methods, and our understanding of SVs is substantially limited. In general, integration of multiple pieces of evidence including split-read, read-pair, soft-clip, and read-depth yields the best result regarding accuracy. However, doing this step by step is usually cumbersome and computationally expensive.

Result: We present Vaquita, an accurate and fast tool for the identification of structural variations, which leverages all four types of evidence in a single program. After merging SVs from split-reads and discordant read-pairs, Vaquita realigns the soft-clipped reads to the selected regions using a fast bit-vector algorithm. Furthermore, it also considers the discrepancy of depth distribution around breakpoints using Kullback-Leibler divergence. Finally, Vaquita provides an additional metric for candidate selection based on voting, and also provides robust prioritization based on rank aggregation. We show that Vaquita is robust in terms of sequencing coverage, insertion size of the library, and read length, and is comparable or even better for the identification of deletions, inversions, duplications, and translocations than state-of-the-art tools, using both simulated and real datasets. In addition, Vaquita is more than eight times faster than any other tools in comparison.

Availability: Vaquita is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license.

Links

Please Cite

  • Jongkyu Kim, Knut Reinert, Russell Schwartz, Knut Reinert, “Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence”, iss. 88, p. 14, 2017-08.
    cite this publication
    @incollection{fu_mi_publications2133,
     abstract = {Motivation: 
    Comprehensive identification of structural variations (SVs) is a crucial task for studying genetic diversity and diseases. However, it remains challenging. There is only a marginal consensus between different methods, and our understanding of SVs is substantially limited.In general, integration of multiple pieces of evidence including split-read, read-pair, soft-clip, and read-depth yields the best result regarding accuracy. However, doing this step by step is usually cumbersome and computationally expensive. 
    Result: 
    We present Vaquita, an accurate and fast tool for the identification of structural variations, which leverages all four types of evidence in a single program. After merging SVs from split-reads and discordant read-pairs, Vaquita realigns the soft-clipped reads to the selected regions using a fast bit-vector algorithm. Furthermore, it also considers the discrepancy of depth distribution around breakpoints using Kullback-Leibler divergence. Finally, Vaquita provides an additional metric for candidate selection based on voting, and also provides robust prioritization based on rank aggregation. We show that Vaquita is robust in terms of sequencing coverage, insertion size of the library, and read length, and is comparable or even better for the identification of deletions, inversions, duplications, and translocations than state-of-the-art tools, using both simulated and real datasets. In addition, Vaquita is more than eight times faster than any other tools in comparison. 
    Availability: 
    Vaquita is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and can be downloaded at http://github.com/seqan/vaquita},
     address = {Saarbr{\"u}cken/Wadern},
     author = {Jongkyu Kim and Knut Reinert},
     booktitle = {17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
     editor = {Russell Schwartz and Knut Reinert},
     month = {August},
     number = {88},
     pages = {185(13:1)--198(13:14)},
     publisher = {Dagstuhl LIPIcs},
     series = {LIPICS},
     title = {Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence},
     url = {http://publications.imp.fu-berlin.de/2133/},
     year = {2017}
    }

Contact

For questions, comments, or suggestions please contact:

Jongkyu Kim vaquita@jongkyu.kim
˄