2012
DOI: 10.1145/2094072.2094073
|View full text |Cite
|
Sign up to set email alerts
|

Word-based self-indexes for natural language text

Abstract: The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
51
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 47 publications
(51 citation statements)
references
References 62 publications
0
51
0
Order By: Relevance
“…To accomplish this task, we first discard stop words (less significant words such as prepositions, articles, etc.) [4]. Then, we perform stemming (reduce words to roots) [4].…”
Section: B Term Extraction and Classificationmentioning
confidence: 99%
See 1 more Smart Citation
“…To accomplish this task, we first discard stop words (less significant words such as prepositions, articles, etc.) [4]. Then, we perform stemming (reduce words to roots) [4].…”
Section: B Term Extraction and Classificationmentioning
confidence: 99%
“…[4]. Then, we perform stemming (reduce words to roots) [4]. At the end, each program will have a vector of terms where each position in this vector corresponds to the frequency of the term on the program textual description.…”
Section: B Term Extraction and Classificationmentioning
confidence: 99%
“…These indexing structures have attractive worst case efficiency bounds when doing "grep-like" occurrence counting in text. Fariña et al [2012] show how to extend these indexing structures to term-based alphabets. However, the basic selfindexing framework does not directly address the document listing problem whereby a listing of the documents containing the search pattern in some frequency ordering is required.…”
Section: Related and Future Workmentioning
confidence: 99%
“…We consider a balanced wavelet tree with compressed bitmaps (Balanced-WT-RRR, achieving nH k (T ) + o(n log V ) bits [16] as no pointers are used), a Huffman-shaped wavelet tree with plain bitmaps (HWT-PLAIN, achieving n(H 0 (T )+1)(1+o (1))+O(V log n) bits) and with compressed bitmaps (HWT-RRR, achieving nH k (T ) + o(n(H 0 (T )) + 1) + O(V log n) bits), a Hu-Tucker-shaped wavelet tree with plain bitmaps (HTWT-PLAIN, achieving n(H 0 (T )+2)(1+o (1))+O(V log n) bits) and with compressed bitmaps (HTWT-RRR, achieving nH k (T ) + o(n(H 0 (T )) + 1) + O(V log n) bits), and an "alphabet partitioned" representation [1] (A-partition, achieving nH 0 (T )+o(n(H 0 (T )+1)) bits). As a control value, we introduce in the comparison an existing FM-index for words: the WSSA [5], using zero space for samplings. To achieve different space/time trade-offs, we use samplings {32, 64, 128, 180} for bitmaps.…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…Interestingly, self-indexes also offer improvements on natural language indexing [5]. The key idea is to regard the text collection as a sequence of words (and separators between words), so that pattern searches correspond to word and phrase searches over the text collection.…”
Section: Introductionmentioning
confidence: 99%