Complete inverted files for efficient text retrieval and analysis

Blumer, Anselm; Blumer, J.; Haussler, David; McConnell, Ross M.; Ehrenfeucht, Andrzej

doi:10.1145/28869.28873

Cited by 202 publications

(163 citation statements)

References 12 publications

Supporting

Mentioning

157

Contrasting

Unclassified

Order By: Relevance

“…In this subsection, we recall the equivalence relations introduced by Blumer et al [10,1], and then state their properties. Throughout this paper, we consider the equivalence classes of the input string w that ends with a distinct symbol $ that does not appear anywhere else in w. For any string x ∈ Substr(w), let,…”

Section: Equivalence Relations On Stringsmentioning

confidence: 99%

“…However, a given text contains too many substrings to browse or analyze. A reasonable approach is to partition the set of substrings into equivalence classes under the equivalence relation of [1] so that an expert can examine the classes one by one [3]. This equivalence relation groups together substrings that correspond to essentially identical occurrences in the text.…”

Section: Introductionmentioning

confidence: 99%

“…Thus, we consider these succinct expressions of the equivalence classes, which require only O(n) space. The succinct expressions can easily be computed using the CDAWG data structure proposed by [1], which is an acyclic graph structure whose nodes correspond to the equivalence classes. Although CDAWGs can be constructed in O(n) time and space [4], we present a more efficient algorithm based on suffix arrays.…”

Section: Introductionmentioning

confidence: 99%

“…This paper considers enumeration of substring equivalence classes introduced by Blumer et al [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs).…”

mentioning

confidence: 99%

See 3 more Smart Citations

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

et al. 2016

View full text Add to dashboard Cite

Abstract. This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches.

show abstract

Section: Equivalence Relations On Stringsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

et al. 2016

View full text Add to dashboard Cite

show abstract

“…Previous data structures for this problem include the suffix tree [1], the compact directed acyclic word graph (compact DAWG) [2], and the suffix array [3]. The first two approaches take O(n) time to build the data structure, and O(m + k) time to find the k positions where the pattern string occurs.…”

Section: Introductionmentioning

confidence: 99%

Contracted Suffix Trees: A Simple and Dynamic Text Indexing Data Structure

Ehrenfeucht

McConnell

Woo

2009

Combinatorial Pattern Matching

View full text Add to dashboard Cite

Abstract. We address the problem of finding the locations of all instances of a string P in a text T , where of T is allowed to facilitate the queries. Previous data structures for this problem include the suffix tree, the suffix array, and the compact DAWG. We modify a data structure called a sequence tree, which was proposed by Coffman and Eve for hashing, and adapt it to the new problem. We can then produce a list of k occurrences of any string P in T in O(||P || + k) time. Because of properties shared by suffixes of a text that are not shared by arbitrary hash keys, we can build the structure in O(||T ||) time, which is much faster than Coffman and Eve's algorithm. These bounds are as good as those for the suffix tree, suffix array, and the compact DAWG. The advantages are the elementary nature of some of the algorithms for constructing and using the data structure and the asymptotic bounds we can give for updating the data structure when the text is edited.

show abstract

Reducing the space requirement of suffix trees

Kurtz

1999

Softw: Pract. Exper.

247

154

View full text Add to dashboard Cite

We show that suffix trees store various kinds of redundant information. We exploit these redundancies to obtain more space efficient representations. The most space efficient of our representations requires 20 bytes per input character in the worst case, and 10.1 bytes per input character on average for a collection of 42 files of different type. This is an advantage of more than 8 bytes per input character over previous work. Our representations can be constructed without extra space, and as fast as previous representations. The asymptotic running times of suffix tree applications are retained.

show abstract

Complete inverted files for efficient text retrieval and analysis

Cited by 202 publications

References 12 publications

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Contracted Suffix Trees: A Simple and Dynamic Text Indexing Data Structure

Reducing the space requirement of suffix trees

Contact Info

Product

Resources

About