2022
DOI: 10.14569/ijacsa.2022.0130933
|View full text |Cite
|
Sign up to set email alerts
|

An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

Abstract: While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple dedup… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 26 publications
0
2
0
Order By: Relevance
“…Xinyao et al 38 suggested method provides excellent user-defined access control and secure deduplication, protecting data confidentiality and successfully fending off threats. Elouataoui et al 39 discussed safe deduplication techniques for cloud storage that boosts storage effectiveness while protecting data confidentiality and integrity. Cho et al 40 Storage optimization solutions—which fall into three categories: content-based, redaction, and replication—are required due to the increase in blockchain transactions.…”
Section: Related Workmentioning
confidence: 99%
“…Xinyao et al 38 suggested method provides excellent user-defined access control and secure deduplication, protecting data confidentiality and successfully fending off threats. Elouataoui et al 39 discussed safe deduplication techniques for cloud storage that boosts storage effectiveness while protecting data confidentiality and integrity. Cho et al 40 Storage optimization solutions—which fall into three categories: content-based, redaction, and replication—are required due to the increase in blockchain transactions.…”
Section: Related Workmentioning
confidence: 99%
“…Data Uniqueness refers to the fact that an actual entity should not be recorded more than once in a dataset [30]. A uniqueness anomaly is a redundancy of data entries referring to the same real-world entity.…”
Section: ) Uniquenessmentioning
confidence: 99%
“…Uniqueness is a quality dimension that addresses duplicated data referring to the same real-world entity within a dataset [31]. Duplicated data can arise for various reasons, such as data entry errors, merging of datasets, or system glitches.…”
Section: Uniqueness and Consistencymentioning
confidence: 99%