We implement and compare the major current techniques for representing general trees in succinct form. This is important because a general tree of n nodes is usually represented in pointer form, requiring O(n log n) bits, whereas the succinct representations we study require just 2n + o(n) bits and carry out many sophisticated operations in constant time. Yet, there is no exhaustive study in the literature comparing the practical magnitudes of the o(n)-space and the O(1)-time terms. The techniques can be classified into three broad trends: those based on BP (balanced parentheses in preorder), those based on DFUDS (depth-first unary degree sequence), and those based on LOUDS (level-ordered unary degree sequence). BP and DFUDS require a balanced parentheses representation that supports the core operations findopen, findclose, and enclose, for which we implement and compare three major algorithmic proposals. All the tree representations require also core operations rank and select on bitmaps, which are already well studied in the literature. We show how to predict the time and space performance of most variants via combining these core operations, and also study some tree operations for which specialized implementations exist. This is especially relevant for a recent proposal (K. Sadakane and G. Navarro, SODA'10) which, although belonging to class BP, deviates from the main techniques in some cases in order to achieve constant time for the widest range of operations. We experiment over various types of real-life trees and of traversals, and conclude that the latter technique stands out as an excellent practical combination of space occupancy, time performance, and functionality, whereas others, particularly LOUDS, are still interesting in some limited-functionality niches.
Abstract. The LZ-index is a compressed full-text self-index able to represent a text T 1...u , over an alphabet of size σ and with k-th order empirical entropy H k (T ), using 4uH k (T ) + o(u log σ ) bits for any k = o(log σ u). It can report all the occ occurrences of a pattern P 1...m in T in O(m 3 log σ + (m + occ) log u) worst case time. This is the only existing data structure of size O(uH k (T )) able of spending O(log u) time per occurrence reported. Its main drawback is the factor 4 in its space complexity, which makes it larger than other stateof-the-art alternatives. In this paper we present two different approaches to reduce the space requirement of LZ-index. In both cases we achieve (2 + ε)uH k (T ) + o(u log σ ) bits of space, for any constant ε > 0, and we simultaneously improve the search time to O(m 2 log m + (m + occ) log u). Both indexes support displaying any subtext of length ℓ in optimal O(ℓ/ log σ u) time. In addition, we show how the space can be squeezed to (1 + ε)uH k (T ) + o(u log σ ) to obtain a structure with O(m 2 ) average search time for m 2 log σ u.
Abstract. Given a text T [1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P [1..m] in T . In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The LZ-index of Navarro is a compressed full-text self-index able to represent T using 4uH k (T ) + o(u log σ) bits of space, where H k (T ) denotes the k-th order empirical entropy of T , for any k = o(log σ u). This space is about four times the compressed text size. It can locate all the occ occurrences of a pattern P in T in O(m 3 log σ + (m + occ) log u) worst-case time. Despite this index has shown to be very competitive in practice, the O(m 3 log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices, improving the overall performance of the LZ-index. We achieve indices requiring (2 + ǫ)uH k (T ) + o(u log σ) bits of space, for any constant ǫ > 0, which makes our indices the smallest existing LZ-indices. We simultaneously improve the search time to O(m 2 +(m+occ) log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying of any text substring of length ℓ in optimal O(ℓ/ log σ u) time. In addition, we show how the space can be squeezed to (1 + ǫ)uH k (T ) + o(u log σ) to obtain a structure with O(m 2 ) average search time for m 2 log σ u. Alternatively, the search time of LZ-indices can be improved to O((m + occ) log u) with (3 + ǫ)uH k (T ) + o(u log σ) bits of space, which is about half of the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.
Abstract. A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZ-index, in particular, requires 4uH k (1 + o(1)) bits of space, where u is the text length in characters and H k is its k-th order empirical entropy. Although in practice the LZ-index needs 1.0-1.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical space-efficient algorithm to construct LZ-index, requiring (4+ǫ)uH k +o(u) bits of space, for any constant 0 < ǫ < 1, and O(σu) time, being σ the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index.
Given a text T [1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P [1..m] in T . The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a space-efficient representation of it, while at the same time providing indexed access to the text.The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 4 times the size of the compressed text, i.e. 4uH k (T ) + o(u log σ) bits of space, where H k (T ) is the k-th order empirical entropy of text T . This index has shown to be very competitive in practice for locating pattern occurrences and extracting text snippets. However, the LZ-index is larger than competing schemes, and does not offer space/time tuning options, which limits its applicability in many practical scenarios.In this paper we study several ways to reduce the space of LZ-index, from a practical point of view and in different application scenarios. The main idea used to reduce the space is to regard the original index as a navigation scheme that allows us moving between index components. Then we perform an abstract optimization on this scheme, defining alternative schemes that support the same navigation, yet reducing the original redundancy. We obtain reduced LZ-indices requiring 3uH k (T ) + o(u log σ) and (2 + ǫ)uH k (T ) + o(u log σ) bits of space, for any 0 < ǫ < 1. Our LZ-indices have an average locating time of O(m 2 + n σ m/2 ), which is O(m 2 ) for m 2 log σ u. We perform extensive experimentation to show that our developments lead to reduced LZindices that are competitive with the state of the art in many practical situations, providing interesting space/time trade-offs and allowing in many cases to replace the original LZ-index with a smaller, yet competitive, representation. Given the space that our indices require, they are in most cases the best alternative for the key operations of extracting arbitrary text substrings, as well as searching and then displaying the contexts surrounding the pattern occurrences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.