2019
DOI: 10.1186/s40537-019-0205-4
|View full text |Cite
|
Sign up to set email alerts
|

Exploring and cleaning big data with random sample data blocks

Abstract: Introduction MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][1… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 48 publications
0
11
0
Order By: Relevance
“…The selected mechanisms regarding sample-based data cleansing are discussed as follows. Salloum et al (2019) proposed a sampling-based method for exploring and cleansing huge datasets using small computing clusters. The random sample partition (RSP) is a data model represents a data set as a collection of disjoint distributed partitions of data, named RSP blocks.…”
Section: Sample-based Mechanismsmentioning
confidence: 99%
“…The selected mechanisms regarding sample-based data cleansing are discussed as follows. Salloum et al (2019) proposed a sampling-based method for exploring and cleansing huge datasets using small computing clusters. The random sample partition (RSP) is a data model represents a data set as a collection of disjoint distributed partitions of data, named RSP blocks.…”
Section: Sample-based Mechanismsmentioning
confidence: 99%
“…To resolve such problem, we take the random sampling algorithm 15 into the task metadata indexing process. The TaskMetas are grouped according to the actual transaction data size, and then uniformly distributed to different partitions of Spark RDD.…”
Section: System Design and Optimizationmentioning
confidence: 99%
“…With the advent of big data, many text data has been created, and research using text mining is being actively conducted [12]- [18]. In text-mining-related studies, researches on topics such as morphological analysis of texts, methodological research related to preprocessing, topic modeling, emotional dictionary construction, and emotional analysis have been reported.…”
Section: A Related Workmentioning
confidence: 99%