• 17th international workshop on algorithms in bioinformatics (wabi 2017), R. Schwartz and K. Reinert, Eds., Saarbrücken/Wadern: Dagstuhl lipics, 2017, vol. 88.
[Bibtex]
@book{fu_mi_publications2132,
publisher = {Dagstuhl LIPIcs},
month = {August},
series = {LIPICS},
year = {2017},
volume = {88},
title = {17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
editor = {Russell Schwartz and Knut Reinert},
abstract = {This proceedings volume contains papers presented at the 17th Workshop on Algorithms in Bioinformatics (WABI 2017), which was held in Boston, MA, USA in conjunction with the 8th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB) from August 21?23, 2017.
The Workshop on Algorithms in Bioinformatics is an annual conference established in 2001 to cover all aspects of algorithmic work in bioinformatics, computational biology, and systems biology. The workshop is intended as a forum for discrete algorithms and machine-learning methods that address important problems in molecular biology, that are founded on sound models, that are computationally efficient, and that have been implemented and tested in simulations and on real datasets. The meeting?s focus is on recent research results, including significant work-in-progress, as well as identifying and explore directions of future research.
WABI 2017 is grateful for the support of ACM-BCB in allowing us to cohost the meetings, as well as to ACM-BCB?s sponsors: the Association for Computing Machinery (ACM) and ACM?s SIGBIO.
In 2017, a total of 55 manuscripts were submitted to WABI from which 27 were selected for presentation at the conference. This year, WABI is adopting a new proceedings form, publishing the conference proceedings through the LIPIcs (Leibniz International Proceedings in Informatics) proceedings series. Extended versions of selected papers will be invited for publication in a thematic series in the journal Algorithms for Molecular Biology (AMB), published by BioMed Central.
The 27 papers were selected based on a thorough peer review, involving at least three independent reviewers per submitted paper, followed by discussions among the WABI Program Committee members. The selected papers cover a wide range of topics, including statistical inference, phylogenetic studies, sequence and genome analysis, comparative genomics, and mass spectrometry data analysis.
We thank all the authors of submitted papers and the members of the WABI Program Committee and their reviewers for their efforts that made this conference possible. We are also grateful to the WABI Steering Committee for their help and advice. We also thank all the conference participants and speakers who contribute to a great scientific program. In
particular, we are indebted to the keynote speaker of the conference, Tandy Warnow, for her presentation. We also thank Christopher Pockrandt for setting up the WABI webpage
and Umit Acar for his help with coordinating the WABI and ACM-BCB pages. Finally, we thank the ACM-BCB Organizing Committee, especially Nurit Haspel and Lenore Cowen,
for their hard work in making all of the local arrangements and working closely with us to ensure a successful and exciting WABI and ACM-BCB.},
url = {http://publications.imp.fu-berlin.de/2132/}
}
• E. Audain, J. Uszkoreit, T. Sachsenberg, J. Pfeuffer, X. Liang, H. Hermjakob, A. Sanchez, M. Eisenacher, K. Reinert, D. L. Tabb, O. Kohlbacher, and Y. Perez-Riverol, “In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics,” Journal of proteomics, vol. 150, p. 170–182, 2017.
[Bibtex]
@article{fu_mi_publications1939,
publisher = {Elsevier},
month = {January},
volume = {150},
title = {In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics},
year = {2017},
author = {Enrique Audain and Julian Uszkoreit and Timo Sachsenberg and Julianus Pfeuffer and Xiao Liang and Henning Hermjakob and Aniel Sanchez and Martin Eisenacher and Knut Reinert and David L. Tabb and Oliver Kohlbacher and Yasset Perez-Riverol},
pages = {170--182},
journal = {Journal of Proteomics},
url = {http://publications.imp.fu-berlin.de/1939/},
abstract = {In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF +. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended.}
}
• T. H. Dadi, B. Y. Renard, L. H. Wieler, T. Semmler, and K. Reinert, “Slimm: species level identification of microorganisms from metagenomes,” Peerj, vol. 5, p. e3138, 2017.
[Bibtex]
@article{fu_mi_publications2119,
pages = {e3138},
author = {Temesgen Hailemariam Dadi and Bernhard Y. Renard and Lothar H. Wieler and Torsten Semmler and Knut Reinert},
month = {March},
title = {SLIMM: species level identification of microorganisms from metagenomes},
volume = {5},
year = {2017},
journal = {PeerJ},
url = {http://publications.imp.fu-berlin.de/2119/},
abstract = {Identification and quantification of microorganisms is a significant step in studying the alpha and beta diversities within and between microbial communities respectively. Both identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than when using 16S-rDNA sequences. However, shared regions of DNA among reference genomes and taxonomic units pose a significant challenge in assigning reads correctly to their true origins. The existing microbial community profiling tools commonly deal with this problem by either preparing signature-based unique references or assigning an ambiguous read to its least common ancestor in a taxonomic tree. The former method is limited to making use of the reads which can be mapped to the curated regions, while the latter suffer from the lack of uniquely mapped reads at lower (more specific) taxonomic ranks. Moreover, even if the tools exhibited good performance in calling the organisms present in a sample, there is still room for improvement in determining the correct relative abundance of the organisms. We present a new method Species Level Identification of Microorganisms from Metagenomes (SLIMM) which addresses the above issues by using coverage information of reference genomes to remove unlikely genomes from the analysis and subsequently gain more uniquely mapped reads to assign at lower ranks of a taxonomic tree. SLIMM is based on a few, seemingly easy steps which when combined create a tool that outperforms state-of-the-art tools in run-time and memory usage while being on par or better in computing quantitative and qualitative information at species-level.}
}
• J. Kim and K. Reinert, “Vaquita: fast and accurate identification of structural variation using combined evidence,” in 17th international workshop on algorithms in bioinformatics (wabi 2017), R. Schwartz and K. Reinert, Eds., Saarbrücken/Wadern: Dagstuhl lipics, 2017, p. 185(13:1)–198(13:14).
[Bibtex]
@incollection{fu_mi_publications2133,
author = {Jongkyu Kim and Knut Reinert},
pages = {185(13:1)--198(13:14)},
publisher = {Dagstuhl LIPIcs},
month = {August},
booktitle = {17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
series = {LIPICS},
number = {88},
year = {2017},
title = {Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence},
editor = {Russell Schwartz and Knut Reinert},
url = {http://publications.imp.fu-berlin.de/2133/},
abstract = {Motivation:
Comprehensive identification of structural variations (SVs) is a crucial task for studying genetic diversity and diseases. However, it remains challenging. There is only a marginal consensus between different methods, and our understanding of SVs is substantially limited.In general, integration of multiple pieces of evidence including split-read, read-pair, soft-clip, and read-depth yields the best result regarding accuracy. However, doing this step by step is usually cumbersome and computationally expensive.
Result:
We present Vaquita, an accurate and fast tool for the identification of structural variations, which leverages all four types of evidence in a single program. After merging SVs from split-reads and discordant read-pairs, Vaquita realigns the soft-clipped reads to the selected regions using a fast bit-vector algorithm. Furthermore, it also considers the discrepancy of depth distribution around breakpoints using Kullback-Leibler divergence. Finally, Vaquita provides an additional metric for candidate selection based on voting, and also provides robust prioritization based on rank aggregation. We show that Vaquita is robust in terms of sequencing coverage, insertion size of the library, and read length, and is comparable or even better for the identification of deletions, inversions, duplications, and translocations than state-of-the-art tools, using both simulated and real datasets. In addition, Vaquita is more than eight times faster than any other tools in comparison.
Availability:
}
• G. Meyers, M. Pop, K. Reinert, and T. Warnow, “Dagstuhl reports, vol. 6, no. 8, pp. 91-130: next generation sequencing (dagstuhl seminar 16351),” , iss. DOI: 10.4230/DagRep.6.8.91, 2017.
[Bibtex]
@manual{fu_mi_publications2134,
number = {DOI: 10.4230/DagRep.6.8.91},
year = {2017},
title = {Dagstuhl Reports, Vol. 6, No. 8, pp. 91-130: Next Generation Sequencing (Dagstuhl Seminar 16351)},
type = {Documentation},
author = {Gene Meyers and Mihai Pop and Knut Reinert and Tandy Warnow},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
url = {http://publications.imp.fu-berlin.de/2134/},
abstract = {Next Generation Sequencing (NGS) data have begun to appear in many applications that are clinically relevant, such as resequencing of cancer patients, disease-gene discovery and diagnostics for rare diseases, microbiome analyses, and gene expression profiling. The analysis of sequencing data is demanding because of the enormous data volume and the need for fast turnaround time, accuracy, reproducibility, and data security. This Dagstuhl Seminar aimed at a free and deep exchange of ideas and needs between the communities of algorithmicists and theoreticians and practitioners from the biomedical field. It identified several relevant fields such as data structures and algorithms for large data sets, hardware acceleration, new problems in the upcoming age of genomes, etc. which were discussed in breakout groups.}
}
• J. Pfeuffer, T. Sachsenberg, O. Alka, M. Walzer, A. Fillbrunn, L. Nilse, O. Schilling, K. Reinert, and O. Kohlbacher, “Openms ? a platform for reproducible analysis of mass spectrometry data,” Journal of biotechnology, vol. 261, p. 142–148, 2017.
[Bibtex]
@article{fu_mi_publications2116,
author = {Julianus Pfeuffer and Timo Sachsenberg and Oliver Alka and Mathias Walzer and Alexander Fillbrunn and Lars Nilse and Oliver Schilling and Knut Reinert and Oliver Kohlbacher},
pages = {142--148},
journal = {Journal of Biotechnology},
publisher = {ELSEVIER},
month = {November},
volume = {261},
title = {OpenMS ? A platform for reproducible analysis of mass spectrometry data},
year = {2017},
abstract = {Background
In recent years, several mass spectrometry-based omics technologies emerged to investigate qualitative and quantitative changes within thousands of biologically active components such as proteins, lipids and metabolites. The research enabled through these methods potentially contributes to the diagnosis and pathophysiology of human diseases as well as to the clarification of structures and interactions between biomolecules. Simultaneously, technological advances in the field of mass spectrometry leading to an ever increasing amount of data, demand high standards in efficiency, accuracy and reproducibility of potential analysis software.
Results
This article presents the current state and ongoing developments in OpenMS, a versatile open-source framework aimed at enabling reproducible analyses of high-throughput mass spectrometry data. It provides implementations of frequently occurring processing operations on MS data through a clean application programming interface in C++ and Python. A collection of 185 tools and ready-made workflows for typical MS-based experiments enable convenient analyses for non-developers and facilitate reproducible research without losing flexibility.
Conclusions
OpenMS will continue to increase its ease of use for developers as well as users with improved continuous integration/deployment strategies, regular trainings with updated training materials and multiple sources of support. The active developer community ensures the incorporation of new features to support state of the art research.},
url = {http://publications.imp.fu-berlin.de/2116/}
}
• C. Pockrandt, M. Ehrhardt, and K. Reinert, “Epr-dictionaries: a practical and fast data structure for constant time searches in unidirectional and bidirectional fm indices,” in Research in computational molecular biology. recomb 2017, S. Sahinalp, Ed., Springer, cham, 2017, vol. 10229, p. 190–206.
[Bibtex]
@incollection{fu_mi_publications2118,
pages = {190--206},
author = {Christopher Pockrandt and Marcel Ehrhardt and Knut Reinert},
editor = {S. Sahinalp},
title = {EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices},
volume = {10229},
year = {2017},
series = {Lecture Notes in Computer Science (LNCS)},
month = {April},
booktitle = {Research in Computational Molecular Biology. RECOMB 2017},
publisher = {Springer, Cham},
abstract = {The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If {\ensuremath{\sigma}} is the size of the alphabet then the method of Lam et al. can conduct one step in time O({\ensuremath{\sigma}}) while needing space O({\ensuremath{\sigma}}{$\cdot$}n) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to O(log{\ensuremath{\sigma}}) while using O(log{\ensuremath{\sigma}}{$\cdot$}n) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.
In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in O(1) time per step while using O(log{\ensuremath{\sigma}}{$\cdot$}n)+o(log{\ensuremath{\sigma}}{$\cdot$}{\ensuremath{\sigma}}{$\cdot$}n)
bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).
We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between {$\approx$}2.2?4.2 times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.},
url = {http://publications.imp.fu-berlin.de/2118/}
}
• K. Reinert, T. H. Dadi, M. Ehrhardt, H. Hauswedell, S. Mehringer, R. Rahn, J. Kim, C. Pockrandt, J. Winkler, E. Siragusa, G. Urgese, and D. Weese, “The seqan c++ template library for efficient sequence analysis: a resource for programmers,” Journal of biotechnology, vol. 261, p. 157–168, 2017.
[Bibtex]
@article{fu_mi_publications2103,
year = {2017},
title = {The SeqAn C++ template library for efficient sequence analysis: A resource for programmers},
volume = {261},
month = {November},
publisher = {ELSEVIER},
journal = {Journal of Biotechnology},
pages = {157--168},
author = {Knut Reinert and Temesgen Hailemariam Dadi and Marcel Ehrhardt and Hannes Hauswedell and Svenja Mehringer and Ren{\'e} Rahn and Jongkyu Kim and Christopher Pockrandt and J{\"o}rg Winkler and Enrico Siragusa and Gianvito Urgese and David Weese},
abstract = {Background
The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome (Venter et al., 2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there was a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. We previously addressed this by introducing the SeqAn library of efficient data types and algorithms in 2008 (D{\"o}ring et al., 2008).
Results
The SeqAn library has matured considerably since its first publication 9 years ago. In this article we review its status as an established resource for programmers in the field of sequence analysis and its contributions to many analysis tools.
Conclusions
We anticipate that SeqAn will continue to be a valuable resource, especially since it started to actively support various hardware acceleration techniques in a systematic manner.
Keywords
NGS analysis; Software libraries; C++; Data structures},
keywords = {NGS analysis; Software libraries; C++; Data structures},
url = {http://publications.imp.fu-berlin.de/2103/}
}
• J. T. Roehr, C. Dieterich, and K. Reinert, “Flexbar 3.0 ? simd and multicore parallelization,” Bioinformatics, vol. 33, iss. 18, p. 2941–2942, 2017.
[Bibtex]
@article{fu_mi_publications2117,
month = {September},
year = {2017},
number = {18},
title = {Flexbar 3.0 ? SIMD and multicore parallelization},
volume = {33},
pages = {2941--2942},
author = {Johannes T. Roehr and Christoph Dieterich and Knut Reinert},
journal = {Bioinformatics},
url = {http://publications.imp.fu-berlin.de/2117/},
abstract = {Motivation:
High-throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next-generation sequencing data. Flexbar performs demultiplexing based on barcodes and adapter trimming for such data. The massive amounts of data generated on modern sequencing machines demand that this preprocessing is done as efficiently as possible.
Results:
We present Flexbar 3.0, the successor of the popular program Flexbar. It employs now twofold parallelism: multi-threading and additionally SIMD vectorization. Both types of parallelism are used to speed-up the computation of pair-wise sequence alignments, which are used for the detection of barcodes and adapters. Furthermore, new features were included to cover a wide range of applications. We evaluated the performance of Flexbar based on a simulated sequencing dataset. Our program outcompetes other tools in terms of speed and is among the best tools in the presented quality benchmark.
Availability and implementation:
https://github.com/seqan/flexbar
Contact:
johannes.roehr@fu-berlin.de or knut.reinert@fu-berlin.de}
}
• B. Vatansever, A. Muñoz, C. L. Klein, and K. Reinert, “Development and optimisation of a generic micro lc-esi-ms method for the qualitative and quantitative determination of 30-mer toxic gliadin peptides in wheat flour for food analysis,” Analytical and bioanalytical chemistry, vol. 409, iss. 4, p. 989–997, 2017.
[Bibtex]
@article{fu_mi_publications1976,
pages = {989--997},
author = {B. Vatansever and A. Mu{\~n}oz and C. L. Klein and K. Reinert},
journal = {Analytical and Bioanalytical Chemistry},
publisher = {Springer Berlin Heidelberg},
month = {February},
volume = {409},
title = {Development and optimisation of a generic micro LC-ESI-MS method for the qualitative and quantitative determination of 30-mer toxic gliadin peptides in wheat flour for food analysis},
number = {4},
year = {2017},
url = {http://publications.imp.fu-berlin.de/1976/},
abstract = {We sometimes see manufactured bakery products on the market which are labelled as being gluten free. Why is the content of such gluten proteins of importance for the fabrication of bakery industry and for the products? The gluten proteins represent up to 80 \% of wheat proteins, and they are conventionally subdivided into gliadins and glutenins. Gliadins belong to the proline and glutamine-rich prolamin family. Its role in human gluten intolerance, as a consequence of its harmful effects, is well documented in the scientific literature. The only known therapy so far is a gluten-free diet, and hence, it is important to develop robust and reliable analytical methods to quantitatively assess the presence of the identified peptides causing the so-called coeliac disease. This work describes the development of a new, fast and robust micro ion pair-LC-MS analytical method for the qualitative and quantitative determination of 30-mer toxic gliadin peptides in wheat flour. The use of RapiGest? SF as a denaturation reagent prior to the enzymatic digestion showed to shorten the measuring time. During the optimisation of the enzymatic digestion step, the best 30-mer toxic peptide was identified from the maximum recovery after 3 h of digestion time. The lower limit of quantification was determined to be 0.25 ng/{\ensuremath{\mu}}L. The method has shown to be linear for the selected concentration range of 0.25?3.0 ng/{\ensuremath{\mu}}L. The uncertainty related to reproducibility of measurement procedure, excluding the extraction step, has shown to be 5.0 \% (Nâ€‰=â€‰12). Finally, this method was successfully applied to the quantification of 30-mer toxic peptides from commercial wheat flour with an overall uncertainty under reproducibility conditions of 6.4 \% including the extraction of the gliadin fraction. The results were always expressed as the average of the values from all standard concentrations. Subsequently, the final concentration of the 30-mer toxic peptide in the flour was calculated and expressed in milligrams per gram unit. The determined, calculated concentration of the 30-mer toxic peptide in the flour was found to be 1.29â€‰{$\pm$}â€‰0.37 {\ensuremath{\mu}}g/g in flour (Nâ€‰=â€‰25, syâ€‰=â€‰545,075, fâ€‰=â€‰25â€‰?â€‰2 (tâ€‰=â€‰2.069), Pâ€‰=â€‰95 \%, two-sided).}
}