Text compression techniques like bzip2 lack the possibility to search or to update substrings at given positions of texts that have been compressed without prior decompression of the compressed text. We have developed Indexed Reversible Transformation (IRT), a modified version of the Burrows-Wheeler-Transformation (BWT) that in combination with run length encoding (RLE) and wavelet trees (WT) allows for position-based searching and updating substrings of compressed texts without prior decompression of the compressed text. As a result, IRT may be useful for a huge class of applications that due to space limitations prefer to search or to modify compressed texts instead of uncompressed texts.
Abstract. Text compression techniques like bzip2 lack the possibility to insert or to delete strings at a given position into a text that has been compressed without prior decompression of the compressed text. We present a technique called DICIRT that supports fast insertion into and deletion from compressed texts without full decompression of the compressed text. For inserted fragments up to a size of 8% of the original text size, and for deleted fragments up to 15% of the original text DICIRT is faster than modifying uncompressed text preceded by a decompression step and followed by a compression step.
Text compression techniques like bzip2 lack the possibility to delete the n th word or to insert text before the n th word of compressed texts without prior decompression of the compressed texts. We present a text compression technique that supports fast insertion into and deletion from compressed texts without full decompression of the compressed text. Our approach combines Indexed Reversible Transformation (IRT) [1], Run-Length-Encoding (RLE), and the Wavelet Tree (WT). For a reasonable size of inserted or deleted texts (more details are given in [2]), our approach is faster than modifying uncompressed text preceded by a decompression step and followed by a compression step.Let IRT(S) denote the Burrows-Wheeler-Transformation (BWT) applied to a text S according to an ordering relation A $ that fulfills the following conditions. The lexicographical order of the word delimiters '$' is changed in such a way, that all '$' of S get the smallest lexicographical order in A $ , and most important, the order of the word delimiters among themselves is determined by their occurrence in S from left to right. That is, the n th word delimiter '$' appearing in S gets a smaller lexicographical order in A $ than the n+1 st word delimiter '$'. Furthermore, IRT sorts characters of S according to their prefix (opposed to BWT that sorts them according to their suffix). Thus, in contrast to BWT, the first character of the n th word of S occurs at position n of IRT(S). This provides a selfindex to the first character of each word of S, which allows for the reconstruction of each word individually without retransforming IRT(S) in total [1].On IRT(S), we apply RLE that returns a bit-stream B and a string R. B is the run-length bit-vector of IRT(S) that contains a 0-bit for each character in IRT(S) that is equal to the previous character and a 1-bit otherwise. R is IRT(S) after reducing each run of equal characters within IRT(S) to one character. The WT W of R stores the bits of the Huffman codes of all characters c i of R, such that for each c i , (1) the Huffman code of c i is stored on a path from the root node to the leaf node representing c i in W, (2) one bit of the Huffman code of c i is stored on each node of the path to c i except for the leaf node, (3) if a 0-bit (1-bit) is stored in a node of W, the path continues with the left (right) sub-tree of W.To delete the n th word of S, we search, mark, and remove the bits of its letters from B and W. The search starts at position B[n] and uses Rank and Select functions [2] on B and the nodes of W to proceed to the bits representing the remaining characters of a word. The found bits are marked, and all unmarked bits represent the compressed text after deleting the n th word of S. To insert a word before the n th word of S into the compressed representation (B,W) of S, we could, reverse to deletion, letter by letter, search and mark the insert positions in B and W and insert the appropriate bits, which computes the compressed representation (B2,W2) of the result string S2. However, in...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.