Query-driven document partitioning and collection selection

Puppin, Diego; Silvestri, Fabrizio; Laforenza, Domenico

doi:10.1145/1146847.1146881

Cited by 41 publications

(46 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Puppin et al [15] used query logs to organize document collection into multiple shards. The query log covered a period of time when exhaustive search was used for each query.…”

Section: Document Allocationmentioning

confidence: 99%

Document allocation policies for selective searching of distributed indexes

Kulkarni

Callan

2010

Proceedings of the 19th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. For organizations with modest computational resources the high query processing cost incurred in this exhaustive search setup can be a deterrent to working with large collections. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search accuracy. Random, source-based and topic-based document-to-shard allocation policies are studied in the context of selective search.A thorough study of the tradeoff between search cost and search accuracy in a sharded index environment is performed using three large TREC collections. The experimental results demonstrate that selective search using topic-based shards cuts the search cost to less than 1/5th of that of the exhaustive search without reducing search accuracy across all the three datasets. Stability analysis shows that 90% of the queries do as well or improve with selective search. An overlap-based evaluation with an additional 1000 queries for each dataset tests and confirms the conclusions drawn using the smaller TREC query sets.

show abstract

“…Puppin et al [15] used query logs to organize document collection into multiple shards. The query log covered a period of time when exhaustive search was used for each query.…”

Section: Document Allocationmentioning

confidence: 99%

Document allocation policies for selective searching of distributed indexes

Kulkarni

Callan

2010

Proceedings of the 19th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…Various approximations of relevance have been studied in P2PIR: assuming documents containing all query keywords to be relevant [2], using "approximate descriptions of relevant material" [1] or comparing results of distributed algorithms to results of a centralised system [6,4,9,8].…”

Section: Related Workmentioning

confidence: 99%

“…those with score > 0) relevant [6] -resulting in what is sometimes called relative recall (RR) -or just the N most highly ranked documents [4,9,8]. In the latter case, precision at k documents is used as an evaluation measurewe will call it P N @k in the rest of this work, denoting its dependence on N .…”

Section: Related Workmentioning

confidence: 99%

“…In [8], N is chosen equal to k, in [4,9], values of 50 and 100 are used without further justification. Besides the problem of choosing N , this set-based approach also neglects the ranking of the centralised system within the first N documents.…”

Section: Average Ranked Relative Recallmentioning

confidence: 99%

“…via author information [2], built-in categories [1] or domains of web pages [4], or it is established in less natural ways via clustering [6] or even randomly [5]. Since generally these collections lack queries and relevance judgments, queries are either constructed from the documents [4,2,1] or taken from query logs matching the collection [6,8].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations