The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in nlg|Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time.\ud\ud The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg|Σ| n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg |Σ|) time or in O(m + lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences.\ud\ud We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/ lg n + lgε n) search time in the worst case, for any constant\ud\ud −1 |Σ| |Σ| 0 < ε ≤ 1, using at most ε + O(1) n lg |Σ| bits of storage. Our result thus presents for the first\ud\ud time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lgε|Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/ lg|Σ| n + occ) time, which is optimal
We introduce a new text-indexing data structure, the String B-Tree , that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String B-Tree overcomes the theoretical limitations of inverted files, B-trees, prefix B-trees, suffix arrays, compacted tries and suffix trees. String B-trees have the same worst-case performance as B-trees but they manage unbounded-length strings and perform much more powerful search operations such as the ones supported by suffix trees. String B-trees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
Consider a sequence S of n symbols drawn from an al- phabet A = {1,2,...,σ}, stored as a binary string of n log σ bits. A succinct data structure on S supports a given set of primitive operations on S using just f(n) = o(n log σ) extra bits. We present a technique for trans- forming succinct data structures (which do not change the binary content of S) into compressed data structures usingnHk+f(n)+O(n(logσ+loglogσn+k)/logσn) bits of space, where Hk ≤ log σ is the kth-order empiri- cal entropy of S. When k+logσ = o(logn), we improve the space complexity of the succinct data structure from nlogσ+o(nlogσ) to nHk +o(nlogσ) bits by keeping S in compressed format, so that any substring of O(logσ n) symbols in S (i.e. O(log n) bits) can be decoded on the fly in constant time. Thus, the time complexity of the supported operations does not change asymptotically. Namely, if an operation takes t(n) time in the succinct data structure, it requires O(t(n)) time in the resulting compressed data structure. Using this simple approach we improve the space complexity of some of the best known results on succinct data structures We extend our results to handle another definition of entropy
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.