Estimating corpus size via queries

Bröder, Arndt; Fontura, Marcus; Josifovski, Vanja; Kumar, Raman; Motwani, Rajeev; Nabar, Shubha U.; Panigrahy, Rina; Tomkins, Andrew; Xu, Ying

doi:10.1145/1183614.1183699

Cited by 46 publications

(79 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, in [25] the authors develop an algorithm to estimate the size of any set of documents defined by certain conditions based on previously executed queries. Whereas [26] describes an algorithm to estimate the corpus size for a meta-search engine in order to better direct queries to search engines.…”

Section: Related Workmentioning

confidence: 99%

Crowdsourced enumeration queries

Trushkowsky

Kraska

Franklin

et al. 2013

2013 IEEE 29th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Abstract-Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.

show abstract

Section: Related Workmentioning

confidence: 99%

Crowdsourced enumeration queries

Trushkowsky

Kraska

Franklin

et al. 2013

2013 IEEE 29th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

show abstract

“…The average degree of the graph is 2. The sample degrees taken by RN, RE, and RW sampling methods are (1, 1, 1, 1, 2, 8), (1,8,1,8,2,4), and (4,3,8,1,8,1), respectively. The estimations for RN, RE, and RW samples are:…”

Section: Rn Re and Rw Samplingmentioning

confidence: 99%

“…Regardless of the causes, a common challenge is to reveal the properties of such datasets when we do not own the entire data. In the past, extensive research was carried out to explore the profile of search engines [17] and other data collections [4,6,33]. Most of them focused on obtaining uniform random node (RN) samples, such as uniform random web pages from the Web [11] and search engines [2], and uniform random bloggers from online social networks [9].…”

Section: Introductionmentioning

confidence: 99%

Variance reduction in large graph sampling

Lü

Wang

2014

Information Processing & Management

View full text Add to dashboard Cite

The norm of practice in estimating graph properties is to use uniform random node (RN) samples whenever possible. Many graphs are large and scale-free, inducing large degree variance and estimator variance. This paper shows that random edge (RE) sampling and the corresponding harmonic mean estimator for average degree can reduce the estimation variance significantly. First, we demonstrate that the degree variance, and consequently the variance of the RN estimator, can grow almost linearly with data size for typical scale-free graphs. Then we prove that the RE estimator has a variance bounded from above. Therefore, the variance ratio between RN and RE samplings can be very large for big data. The analytical result is supported by both simulation studies and 18 real networks. We observe that the variance reduction ratio can be more than a hundred for some real networks such as Twitter. Furthermore, we show that random walk (RW) sampling is always worse than RE sampling, and it can reduce the variance of RN method only when its performance is close to that of RE sampling.

show abstract

“…By applying our techniques continuously, an advertiser can track the popularity of her keywords over time. 6 Search engine evaluation and ImpressionRank sampling.…”

Section: Google Toolbarmentioning

confidence: 99%

“…The limited access to search engines' indices via their public interfaces make the problem of evaluating the quality of these indices very challenging. Previous work [5,3,6,4] has focused on generating random uniform sample pages from the index. The samples have then been used to estimate index quality metrics, like index size and index freshness.…”

Section: Google Toolbarmentioning

confidence: 99%

Mining search engine query logs via suggestion sampling

Bar-Yossef

Gurevich²

2008

Proc. VLDB Endow.

View full text Add to dashboard Cite

Many search engines and other web applications suggest auto-completions as the user types in a query. The suggestions are generated from hidden underlying databases, such as query logs, directories, and lexicons. These databases consist of interesting and useful information, but they are typically not directly accessible.In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden suggestion databases. Example applications include comparison of popularity of given keywords within a search engine's query log, estimation of the volume of commerciallyoriented queries in a query log, and evaluation of the extent to which a search engine exposes its users to negative content.Our algorithms employ Monte Carlo methods in order to obtain unbiased samples from the suggestion database. Empirical analysis using a publicly available query log demonstrates that our algorithms are efficient and accurate. Results of experiments on two major suggestion services are also provided.

show abstract

Estimating corpus size via queries

Cited by 46 publications

References 13 publications

Crowdsourced enumeration queries

Crowdsourced enumeration queries

Variance reduction in large graph sampling

Mining search engine query logs via suggestion sampling

Contact Info

Product

Resources

About