Abstract:We consider the problem of estimating the size of a collection of documents using only a standard query interface. Our main idea is to construct an unbiased and low-variance estimator that can closely approximate the size of any set of documents defined by certain conditions, including that each document in the set must match at least one query from a uniformly sampleable query pool of known size, fixed in advance.Using this basic estimator, we propose two approaches to estimating corpus size. The first approa… Show more
“…For example, in [25] the authors develop an algorithm to estimate the size of any set of documents defined by certain conditions based on previously executed queries. Whereas [26] describes an algorithm to estimate the corpus size for a meta-search engine in order to better direct queries to search engines.…”
Abstract-Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.
“…For example, in [25] the authors develop an algorithm to estimate the size of any set of documents defined by certain conditions based on previously executed queries. Whereas [26] describes an algorithm to estimate the corpus size for a meta-search engine in order to better direct queries to search engines.…”
Abstract-Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.
“…The average degree of the graph is 2. The sample degrees taken by RN, RE, and RW sampling methods are (1, 1, 1, 1, 2, 8), (1,8,1,8,2,4), and (4,3,8,1,8,1), respectively. The estimations for RN, RE, and RW samples are:…”
Section: Rn Re and Rw Samplingmentioning
confidence: 99%
“…Regardless of the causes, a common challenge is to reveal the properties of such datasets when we do not own the entire data. In the past, extensive research was carried out to explore the profile of search engines [17] and other data collections [4,6,33]. Most of them focused on obtaining uniform random node (RN) samples, such as uniform random web pages from the Web [11] and search engines [2], and uniform random bloggers from online social networks [9].…”
The norm of practice in estimating graph properties is to use uniform random node (RN) samples whenever possible. Many graphs are large and scale-free, inducing large degree variance and estimator variance. This paper shows that random edge (RE) sampling and the corresponding harmonic mean estimator for average degree can reduce the estimation variance significantly. First, we demonstrate that the degree variance, and consequently the variance of the RN estimator, can grow almost linearly with data size for typical scale-free graphs. Then we prove that the RE estimator has a variance bounded from above. Therefore, the variance ratio between RN and RE samplings can be very large for big data. The analytical result is supported by both simulation studies and 18 real networks. We observe that the variance reduction ratio can be more than a hundred for some real networks such as Twitter. Furthermore, we show that random walk (RW) sampling is always worse than RE sampling, and it can reduce the variance of RN method only when its performance is close to that of RE sampling.
“…By applying our techniques continuously, an advertiser can track the popularity of her keywords over time. 6 Search engine evaluation and ImpressionRank sampling.…”
Section: Google Toolbarmentioning
confidence: 99%
“…The limited access to search engines' indices via their public interfaces make the problem of evaluating the quality of these indices very challenging. Previous work [5,3,6,4] has focused on generating random uniform sample pages from the index. The samples have then been used to estimate index quality metrics, like index size and index freshness.…”
Many search engines and other web applications suggest auto-completions as the user types in a query. The suggestions are generated from hidden underlying databases, such as query logs, directories, and lexicons. These databases consist of interesting and useful information, but they are typically not directly accessible.In this paper we describe two algorithms for sampling suggestions using only the public suggestion interface. One of the algorithms samples suggestions uniformly at random and the other samples suggestions proportionally to their popularity. These algorithms can be used to mine the hidden suggestion databases. Example applications include comparison of popularity of given keywords within a search engine's query log, estimation of the volume of commerciallyoriented queries in a query log, and evaluation of the extent to which a search engine exposes its users to negative content.Our algorithms employ Monte Carlo methods in order to obtain unbiased samples from the suggestion database. Empirical analysis using a publicly available query log demonstrates that our algorithms are efficient and accurate. Results of experiments on two major suggestion services are also provided.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.