We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result, we combine several recent findings, including string attractors —new combinatorial objects encompassing most known compressibility measures for highly repetitive texts—and grammars based on locally consistent parsing . More in detail, letγ be the size of the smallest attractor for a text T of length n . The measureγ is an (asymptotic) lower bound to the size of dictionary compressors based on Lempel–Ziv, context-free grammars, and many others. The smallest known text representations in terms of attractors use space O (γ log ( n /γ)), and our lightest indexes work within the same asymptotic space. Let ε > 0 be a suitably small constant fixed at construction time, m be the pattern length, and occ be the number of its text occurrences. Our index counts pattern occurrences in O ( m +log 2+ε n ) time and locates them in O ( m +( occ +1)log ε n ) time. These times already outperform those of most dictionary-compressed indexes, while obtaining the least asymptotic space for any index searching within O (( m + occ ),polylog, n ) time. Further, by increasing the space to O (γ log ( n /γ)log ε n ), we reduce the locating time to the optimal O ( m + occ ), and within O (γ log ( n /γ)log n ) space we can also count in optimal O ( m ) time. No dictionary-compressed index had obtained this time before. All our indexes can be constructed in O ( n ) space and O ( n log n ) expected time. As a by-product of independent interest, we show how to build, in O ( n ) expected time and without knowing the sizeγ of the smallest attractor (which is NP-hard to find), a run-length context-free grammar of size O (γ log ( n /γ)) generating (only) T . As a result, our indexes can be built without knowingγ.
The compressed indexing problem is to preprocess a string S of length n into a compressed representation that supports pattern matching queries. That is, given a string P of length m report all occurrences of P in S. We present a data structure that supports pattern matching queries in O(m + occ(lg lg n + lg ǫ z)) time using O(z lg(n/z)) space where z is the size of the LZ77 parse of S and ǫ > 0 is an arbitrarily small constant, when the alphabet is small or z = O(n 1−δ ) for any constant δ > 0. We also present two data structures for the general case; one where the space is increased by O(z lg lg z), and one where the query time changes from worst-case to expected. These results improve the previously best known solutions. Notably, this is the first data structure that decides if P occurs in S in O(m) time using O(z lg(n/z)) space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D-range reporting.
Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string S is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem.
Abstract. We consider the well-studied partial sums problem in succint space where one is to maintain an array of n k-bit integers subject to updates such that partial sums queries can be efficiently answered. We present two succint versions of the Fenwick Tree -which is known for its simplicity and practicality. Our results hold in the encoding model where one is allowed to reuse the space from the input data. Our main result is the first that only requires nk + o(n) bits of space while still supporting sum/update in O(log b n) / O(b log b n) time where 2 ≤ b ≤ log O(1) n. The second result shows how optimal time for sum/update can be achieved while only slightly increasing the space usage to nk + o(nk) bits. Beyond Fenwick Trees, the results are primarily based on bit-packing and sampling -making them very practical -and they also allow for simple optimal parallelization.
Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index f , called the finger, and the query index i. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported.Let n be the size the grammar, and let N be the size of the string. For the static variant we give a linear space representation that supports placing the finger in O(log N ) time and subsequently accessing in O(log D) time, where D is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in O(log N ) time and accessing and moving the finger in O(log D + log log N ) time. Compared to the best linear space solution to random access, we improve a O(log N ) query bound to O(log D) for the static variant and to O(log D + log log N ) for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.