Static index pruning for information retrieval systems

Carmel, David; Cohen, Doron; Fagin, Ronald; Farchi, Eitan; Herscovici, Michael; Maarek, Yoelle; Soffer, Aya

doi:10.1145/383952.383958

Cited by 162 publications

(204 citation statements)

References 10 publications

Supporting

Mentioning

201

Contrasting

Order By: Relevance

“…In one of the earliest works in this field, Carmel et al proposed term-centric approaches with uniform and adaptive versions [5]. Roughly, adaptive top-k algorithm sorts the posting list of each term according to some scoring function (Smart's TF-IDF in [5]) and removes those postings that have scores under a threshold determined for that particular term.…”

Section: Static Pruning Strategies For Inverted Indexesmentioning

confidence: 99%

“…Roughly, adaptive top-k algorithm sorts the posting list of each term according to some scoring function (Smart's TF-IDF in [5]) and removes those postings that have scores under a threshold determined for that particular term. The algorithm is reported to provide substantial pruning of the index and exhibit excellent performance at keeping the top-ranked results intact in comparison to the original index.…”

Section: Static Pruning Strategies For Inverted Indexesmentioning

confidence: 99%

“…Thus, instead of a crude mechanism, for each element, the decision for indexing the terms from the element's descendants should be given adaptively, considering the element's textual content and search system's ranking function. To this end, we employ two major static index pruning techniques, namely term-centric pruning (TCP) [5] and document-centric pruning (DCP) [4] for indexing the XML collections. Below, we outline these strategies as used in our study.…”

Section: Pruning the Element-index For Xml Retrievalmentioning

confidence: 99%

“…For the purposes of index pruning, we apply two major methods from the IR literature, namely, term-centric [5] and document-centric pruning [4] to prune the full element-index. We evaluate the performance for various retrieval tasks as described in the latest INEX campaigns.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

XML Retrieval Using Pruned Element-Index Files

Altıngövde

Atilgan

Ulusoy

2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. An element-index is a crucial mechanism for supporting content-only (CO) queries over XML collections. A full element-index that indexes each element along with the content of its descendants involves a high redundancy and reduces query processing efficiency. A direct index, on the other hand, only indexes the content that is directly under each element and disregards the descendants. This results in a smaller index, but possibly in return to some reduction in system effectiveness. In this paper, we propose using static index pruning techniques for obtaining more compact index files that can still result in comparable retrieval performance to that of a full index. We also compare the retrieval performance of these pruning based approaches to some other strategies that make use of a direct element-index. Our experiments conducted along with the lines of INEX evaluation framework reveal that pruned index files yield comparable to or even better retrieval performance than the full index and direct index, for several tasks in the ad hoc track.

show abstract

Section: Static Pruning Strategies For Inverted Indexesmentioning

confidence: 99%

Section: Static Pruning Strategies For Inverted Indexesmentioning

confidence: 99%

Section: Pruning the Element-index For Xml Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

XML Retrieval Using Pruned Element-Index Files

Altıngövde

Atilgan

Ulusoy

2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Next, we apply term-centric pruning at different pruning levels, and once the pruned index files are obtained, we convert them to the document vectors to be given to the clustering algorithm 2 . In a nutshell, the term-centric pruning strategy works as follows [8]. For each term t, the postings in t's posting list are sorted according to their score with respect to a ranking function, which is BM25 in our case.…”

Section: Employing Pruning Strategies For Clusteringmentioning

confidence: 99%

Exploiting Index Pruning Methods for Clustering XML Collections

Altıngövde

Atilgan

Ulusoy

2010

Focused Retrieval and Evaluation

View full text Add to dashboard Cite

Abstract. In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C 3 M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics.

show abstract

Metadata harvesting for content‐based distributed information retrieval

Simeoni

Yakici

Neely

et al. 2007

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

We propose an approach to content-based Distributed Information Retrieval based on the periodic and incremental centralization of full-content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative's (OAI) Protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data move toward the retrieval process, but it is statistics about the content rather than content itself; this grants more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision while promoting the simplicity, effectiveness, and responsiveness of retrieval. Overall, we argue that the approach retains the good properties of centralized retrieval without renouncing to costeffective, large-scale resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure. In particular, we define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. Finally, we report on the implementation of a proof-of-concept prototype service for multimodel content-based retrieval of distributed file collections. IntroductionOur interest is in content-based retrieval of widely dispersed and autonomously managed document sources. 1 This is the central problem of Distributed Information Retrieval (DIR), and over the past 10 years, it has been mainly approached by distributing the retrieval process along with the data: Queries have been "pushed" toward the content, and the results of their local execution have been centrally gathered and presented to the user (cf. Callan, 2000a).Traditionally, distributed retrieval services have relied on simple client/server architectures in which brokers route queries submitted by local or remote clients toward a number of mutually autonomous and potentially uncooperative retrieval engines. Figure 1 shows how client/server distributed retrieval works. A Search Broker B interfaces Clients C and dispatches their Queries Q to a number of autonomous search engines. S 1 , S 2 , S n , each of which executes it against an Index FT i of some Content C i before returning Results R i back to B, which merges them and relays them to C. Optionally, B optimizes query distribution by selecting a subset of the engines based on previously gathered descriptions of their content. Based on summary descriptions of the content served by each engine, advanced techniques of source selection and data fusion have been produced to, respectively, minimize network In the lack of a well-established terminology, we use the term contentbased to characterize retrieval processes defined over indices of essentially unstructured documents. Content-based retrieval lies at one...

show abstract

Static index pruning for information retrieval systems

Cited by 162 publications

References 10 publications

XML Retrieval Using Pruned Element-Index Files

XML Retrieval Using Pruned Element-Index Files

Exploiting Index Pruning Methods for Clustering XML Collections

Metadata harvesting for content‐based distributed information retrieval

Contact Info

Product

Resources

About