Weese, D., Schulz, M. H. (2008). Efficient String Mining under Constraints via the Deferred Frequency Index.
Sequential pattern mining, also known as frequency based string mining, has been the objective of various algorithmic approaches over the last years. A recent breakthrough, the linear time algorithm by Fischer, Heun and Kramer (FHK), has turned out to be theoretically optimal and also quite fast in practice. In 2008 we presented a conceptually much simpler algorithm based on a deferred data structure that is faster and uses less memory at the same time. In this paper we give an in-depth presentation of this algorithm and show how to use it on multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to devise the Entropy Substring Mining Problem which is a multiple database generalization of the Emerging Substring Mining Problem. In addition we evaluate the algorithm rigorously using various string domains, e.g., natural language, DNA, or protein sequences. The experiments demonstrate the improvement of our algorithm compared to recent approaches.
Supported platforms are: Windows, Linux and Mac OS X. Please take a look at the README file for usage instructions.
The DFI is part of SeqAn. To compile our implementation follow the instructions in the README file.
Last Update 13. May 2013