2018
DOI: 10.1145/3167970
|View full text |Cite
|
Sign up to set email alerts
|

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Abstract: It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
2
2

Relationship

3
4

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 36 publications
0
8
0
Order By: Relevance
“…The techniques for leveraging aggregate knowledge such as iterative proportional fitting could seamlessly be integrated in our approach. Chung et al [9] estimate the impact of missing tuples on aggregate queries when several data sources are integrated by observing reoccurring tuples. While ReStore similarly helps for the case of different data sources with varying quality again only the single table case is discussed here.…”
Section: Related Workmentioning
confidence: 99%
“…The techniques for leveraging aggregate knowledge such as iterative proportional fitting could seamlessly be integrated in our approach. Chung et al [9] estimate the impact of missing tuples on aggregate queries when several data sources are integrated by observing reoccurring tuples. While ReStore similarly helps for the case of different data sources with varying quality again only the single table case is discussed here.…”
Section: Related Workmentioning
confidence: 99%
“…We therefore started to develop techniques that estimate not only the amount of missing data based on techniques from [100,101,102] but also the impact those items might have on query results [19,20]. We assume a simple data integration scenario in which (semi-)independent data sources are integrated into a single database.…”
Section: Uncertainty As Unknown Unknownsmentioning
confidence: 99%
“…The overlap between the different data sets allows us to estimate the number of missing items using species estimation techniques [100]. Further, it is possible to make estimates about the values the missing items might have using our novel bucket estimator [19]. This way, Vizdom is able to indicate to the user how much impact missing data might have on the visualization.…”
Section: Uncertainty As Unknown Unknownsmentioning
confidence: 99%
“…Unfortunately, existing species estimation techniques [30,10] to estimate the completeness of a set, do not consider that workers can make mistakes. At the same time in any real data cleaning scenario, workers can make both false negative errors (a worker fails to identify a true error) and false positive errors (a worker misclassifies a clean item as dirty).…”
Section: Our Goal and Approachmentioning
confidence: 99%
“…To the best of our knowledge, this is the first work to consider species estimation for data quality quantification. Species estimation has been studied in prior work for distinct count estimation and crowdsourcing [18,29,10]. However, the previous work only considered species estimation on clean data without false positive and false negative errors, which is inherent to the data quality estimation setting.…”
Section: Related Workmentioning
confidence: 99%