2017 IEEE International Congress on Big Data (BigData Congress) 2017
DOI: 10.1109/bigdatacongress.2017.24
|View full text |Cite
|
Sign up to set email alerts
|

Bleach: A Distributed Stream Data Cleaning System

Abstract: In this paper we address the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams.We design a system, called Bleach, which achieves realtime violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it sup… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2019
2019
2025
2025

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 16 publications
0
8
0
Order By: Relevance
“…However, their work assumes a fixed ontology and do not provide a real application scenario. Furthermore, the Bleach system [17] focusses on detection and cleaning of inconsistent data. The rule set is adaptable as in our case.…”
Section: Discussion and Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…However, their work assumes a fixed ontology and do not provide a real application scenario. Furthermore, the Bleach system [17] focusses on detection and cleaning of inconsistent data. The rule set is adaptable as in our case.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…This set can be processed in parallel, since the products are independent of each other (R9). Windows are needed on a stream, since the amount of data would otherwise increase infinitely [17].…”
Section: No Additional Messages Additional Messages Can Be Anmentioning
confidence: 99%
See 1 more Smart Citation
“…The solution proposed for detecting and repairing dirty data in Gohel et al (2017) resolves errors like inconsistency, lack of accuracy, and redundancy, by treating multiple types of quality rules holistically. In Tian et al (2017), a rule-based data-cleaning technique is proposed where a set of rules defines how data should be cleaned. Moreover, the research of Zhang et al (2017) shows an innovative method for correcting values in time series that are considered abnormal, through anomaly detection, where the authors are using the method of iterative minimum repairing (IMR).…”
Section: Policycloud Ingest Analyticsmentioning
confidence: 99%
“…Other algorithms in the literature exploit the violation graph approach, which is based on user-defined rules, to perform data cleaning, such as the equivalence class algorithm [22] or the holistic data cleaning algorithm [23]. In [24], a stream data cleaning system for categorical and numerical data is proposed, which relies on compact data structures to maintain the necessary state to repair data. Dirty data are repaired using the concept of distributed violation graph, which is an extension of the violation graph approach, aimed to improve the scalability performances.…”
Section: Related Workmentioning
confidence: 99%