String Similarity Search/Join

Abstract

We present in this paper scalable algorithms for optimal string similarity search and join. Our methods are variations of those applied in Masai, our recently published tool for mapping high-throughput DNA sequencing data with unpreceded speed and accuracy. The key features of our approach are filtration with approximate seeds and methods for multiple backtracking. Approximate seeds, compared to exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds. Combined together, these two methods significantly speed up string similarity search and join operations.

Binaries Download

Our tool is released under BSD license. It is implemented in C++ and OpenMP using the SeqAn library, and supports Linux, Mac OS X, and Windows.

Compilation from Source Code

Follow the Getting Started section and check out the SeqAn trunk. Instead of creating a project file in Debug mode, switch to Release mode (-DCMAKE_BUILD_TYPE=Release) and compile the project. This can be done as follows:

svn co http://svn.seqan.de/seqan/trunk seqan-trunk
mkdir seqan-trunk/build
cd seqan-trunk/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make search join

If the binaries will be run on the same machine they are compiled on, you may consider to optimize for the given system architecture. For gcc or llvm/clang compilers this can be done with:

cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS:STRING="-march=native"
make search join

After successful compilation, copy the binary to a folder in your PATH variable, e.g. /usr/local/bin:

sudo cp bin/{search, join} /usr/local/bin

Contact

For questions, comments, or suggestions feel free to contact Enrico Siragusa.

References

Siragusa, E., Weese D., & Reinert, K. (2013). Scalable String Similarity Search/Join with Approximate Seeds and Multiple Backtracking. EDBT/ICDT ’13, March 18 – 22 2013, Genoa, Italy

Last Update 11. June 2013