The current explosion of stored information necessitates a new model of pattern matching, that of compressed matching. In this model one tries to find all occurrences of a pattern in a compressed text in time proportional to the compressed text size, i.e., without decompressing the text. The most effective general purpose compression algorithms are adaptive, in that the text represented by each compression symbol is determined dynamically by the data. As a result, the encoding of a substring depends on its location. Thus the same substring may``look different'' every time it appears in the compressed text. In this paper we consider pattern matching without decompression in the UNIX Z-compression. This is a variant of the Lempel Ziv adaptive compression scheme. If n is the length of the compressed text and m is the length of the pattern, our algorithms find the first pattern occurrence in time O(n+m 2 ) or O(n log m+m). We also introduce a new criterion to measure compressed matching algorithms, that of extra space. We show how to modify our algorithms to achieve a trade-off between the amount of extra space used and the algorithm's time complexity.
Abstract. Jumbled indexing is the problem of indexing a text T for queries that ask whether there is a substring of T matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years; for a partial list see [2,6,13,16,17,20,22,24,26,30,35,36]. There is a naive algorithm that preprocesses all answers in O(n 2 |Σ|) time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has O(n log |Σ|) query time. Despite a tremendous amount of effort there has been little improvement over these running times. In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size ω(1) requires Ω(n 2−ǫ ) preprocessing time or Ω(n 1−δ ) query time for any ǫ, δ > 0. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size r ≥ 3 there exist describable fixed constant ǫr and δr such that jumbled indexing requires Ω(n 2−ǫr ) preprocessing time or Ω(n 1−δr ) query time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.