Sequential pattern mining, also known as frequency based string mining, has been the objective of various algorithmic approaches over the last years. A recent breakthrough, the linear time algorithm by Fischer, Heun, and Kramer (FHK), has turned out to be theoretically optimal and also quite fast in practice. In 2008, we presented a conceptually much simpler algorithm based on a deferred data structure that is faster and uses less memory at the same time. In this paper, we give an in-depth presentation of this algorithm and show how to use it on multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to devise the Entropy Substring Mining Problem which is a multiple database generalization of the Emerging Substring Mining Problem. In addition, we evaluate the algorithm rigorously using various string domains, e.g., natural language, DNA, or protein sequences. The experiments demonstrate the improvement of our algorithm compared to recent approaches.


Please Cite

  • D. Weese, M. H. Schulz, P. Perner, “Efficient String Mining under Constraints via the Deferred Frequency Index”, 2008-07.
    cite this publication
     abstract = { We propose a general approach for frequency based string mining,   which has many applications, e.g. in contrast data mining. Our contribution
    is a novel algorithm based on a deferred data structure. Despite
    its simplicity, our approach is up to 4 times faster and uses about
    half the memory compared to the best-known algorithm of Fischer et
    al. Applications in various string domains, e.g. natural language,
    DNA or protein sequences, demonstrate the improvement of our algorithm.},
     author = {D. Weese and M. H. Schulz},
     booktitle = {Proceedings of the 8th Industrial Conference on Data Mining (ICDM'08)},
     editor = {P. Perner},
     month = {July},
     pages = {374--388},
     publisher = {Springer Verlag},
     title = {Efficient String Mining under Constraints via the Deferred Frequency        Index},
     url = {http://publications.imp.fu-berlin.de/412/},
     year = {2008}


For questions, comments, or suggestions please contact:

David Weese david.weese@fu-berlin.de