2012
DOI: 10.1145/2094072.2094077
|View full text |Cite
|
Sign up to set email alerts
|

High-performance processing of text queries with tunable pruned term and term pair indexes

Abstract: Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This article introduces a joint framework for trading off index size and result quality, and provide… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2012
2012
2015
2015

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 41 publications
0
8
0
Order By: Relevance
“…The nextword index (NW) stores absolute positions for all 473,366,430 bi-grams in GOV2. While there are techniques that only partially store lists [3,13,19], we measure the exhaustive case in which all bi-grams are indexed. The position lists of NW are stored using UEF codes, and require 55 GiB, still less than the three suffix-based indexes.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The nextword index (NW) stores absolute positions for all 473,366,430 bi-grams in GOV2. While there are techniques that only partially store lists [3,13,19], we measure the exhaustive case in which all bi-grams are indexed. The position lists of NW are stored using UEF codes, and require 55 GiB, still less than the three suffix-based indexes.…”
Section: Methodsmentioning
confidence: 99%
“…Instead of performing list intersection at query time, the final set of i, fi,P pairs can be stored in the index and accessed when needed by queries. Storage limits mean that precomputing postings lists for all phrases is impossible, and techniques have been explored to choose lists to be computed, including analyzing query logs [3,19] and using collection statistics [13]. Indexing only a subset of the phrases implies that either other ways of creating lists at query time must be provided too, or that retrieval effectiveness must be sacrificed.…”
Section: Phrase Indexing Schemesmentioning
confidence: 99%
“…Existing approaches to improving efficiency make use of term pair co-occurrence indexes, and employ early termination in order to balance space costs and query time [3,19]. But unless the index can be very large, use of proximity-based metrics beyond co-occurrence usually requires on-the-fly computation of proximity scores, and possibly high retrieval times in even moderately sized collections.…”
Section: Introductionmentioning
confidence: 99%
“…However, calculating proximity for all terms is computationally expensive. To address that issue, several recent studies have looked at the trade-offs possible through term pair co-occurrence indexing, or other similar means [3,5,7,14]. In contrast to viewing query terms separately, Song et al [15] group query terms into non-overlapped phrases referred to as a span.…”
Section: Introductionmentioning
confidence: 99%
“…Formalisms such as Coherence Relations (Hobbs, 1990), Discourse Representation Theory (DRT) (Kamp, 1984, Kamp and Reyle, 1993, Bos, 2008, Segmented Discourse Representation Theory (SDRT) (Asher and Lascarides, 2003) and Rhetorical Structure Theory (RST) (Mann and Thompson, 1988, Marcu et al, 1999, Marcu, 2000, are relevant to the segmentation of narrative discourse, but they illuminate other aspects of structure than the ones I am focused on here. 2 As for the discourse-level theories of story grammars, e.g., Rumelhart (1977), van Dijk (1979), these are certainly relevant to plot and will be discussed in Chapter 4. 1 DRT is concerned primarily with reference; while reference is discussed in this chapter, the details of the semantic representations used in DRT are outside the scope of 1.1.…”
Section: Introductionmentioning
confidence: 99%