Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2006
DOI: 10.1145/1148170.1148227
|View full text |Cite
|
Sign up to set email alerts
|

Capturing collection size for distributed non-cooperative retrieval

Abstract: Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
68
0

Year Published

2009
2009
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 48 publications
(69 citation statements)
references
References 24 publications
1
68
0
Order By: Relevance
“…Data Analytics over Hidden Databases: There has been prior work on crawling, sampling, and aggregate estimation over the hidden web, specifically over text [3,4] and structured [16] hidden databases and search engines [15,18,2]. Specifically, samplingbased methods were used for generating content summaries [6,14,11], processing top-k queries [5], etc.…”
Section: Related Workmentioning
confidence: 99%
“…Data Analytics over Hidden Databases: There has been prior work on crawling, sampling, and aggregate estimation over the hidden web, specifically over text [3,4] and structured [16] hidden databases and search engines [15,18,2]. Specifically, samplingbased methods were used for generating content summaries [6,14,11], processing top-k queries [5], etc.…”
Section: Related Workmentioning
confidence: 99%
“…This process is repeated and the estimation is based on the size of the number of documents collected in the two samples. Shokouhi et al [18] explores the problem of resource selection in non-cooperative environments and propose a querybased sampling QBS method based on the capture-recapture method [19] referred as multiple capture-recapture. They showed with a large set of heterogeneous collections that their method outperforms previous approaches using search engines in non-cooperative environments.…”
Section: Vertical Size Estimationmentioning
confidence: 99%
“…The sample was constructed by choosing randomly 5 queries from the set of 2K queries (with replacement) and collecting the top 10 results for each query. Similar parameter values have been used in previous studies [18]. It is important to mention that the set of 90 queries employed to gather user assessments were not employed in the samples generated for the size estimation process to avoid bias in the evaluation of vertical selection methods.…”
Section: Vertical Size Estimationmentioning
confidence: 99%
“…Capture-recapture methods (Shokouhi et al 2006) rely on using the overlap between datasets in order to estimate the overall properties of the sampled set. However, in our case, we are sampling truly interacting and falsepositive protein pairs from the possible set of distinct protein pairs: i.e.…”
Section: Experimental Protein Interaction Datamentioning
confidence: 99%
“…the number of items recaptured) is used to estimate the complete population's size. Multiple capture-recapture (Shokouhi et al 2006) is an extension of this approach to account for any number of samples. This has been used to estimate the size of different populations by observing the overlap between different samples (Xu et al 2007).…”
Section: Single Urn Models For Protein Interaction Datamentioning
confidence: 99%