Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems 2018
DOI: 10.1145/3196959.3196978
|View full text |Cite
|
Sign up to set email alerts
|

Distinct Sampling on Streaming Data with Near-Duplicates

Abstract: In this paper we study how to perform distinct sampling in the streaming model where data contain near-duplicates. The goal of distinct sampling is to return a distinct element uniformly at random from the universe of elements, given that all the nearduplicates are treated as the same element. We also extend the result to the sliding window cases in which we are only interested in the most recent items. We present algorithms with provable theoretical guarantees for datasets in the Euclidean space, and also ver… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 41 publications
0
2
0
Order By: Relevance
“…Generally, solutions for recognizing ADLs are underpinned with rule-based or knowledge-driven supported by conventional Machine Learning (ML) algorithms [2,3]. In such environments, the embedded or wireless sensors generate high volumes of streaming data [4], which in a real world setting can contain huge amounts of missing values or duplicate values [5]. Such noisy and imprecise data may lead to one of the major causes of an erroneous classification or imprecise recognition.…”
Section: Introductionmentioning
confidence: 99%
“…Generally, solutions for recognizing ADLs are underpinned with rule-based or knowledge-driven supported by conventional Machine Learning (ML) algorithms [2,3]. In such environments, the embedded or wireless sensors generate high volumes of streaming data [4], which in a real world setting can contain huge amounts of missing values or duplicate values [5]. Such noisy and imprecise data may lead to one of the major causes of an erroneous classification or imprecise recognition.…”
Section: Introductionmentioning
confidence: 99%
“…In such environments, the real-world streaming dataset is almost of the same content as near-duplicates [1]. This leads to the noisy and imprecise state, causing an erroneous classification and recognition.…”
Section: Introductionmentioning
confidence: 99%