2023
DOI: 10.1016/j.infsof.2023.107268
|View full text |Cite
|
Sign up to set email alerts
|

A survey on dataset quality in machine learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 54 publications
(11 citation statements)
references
References 22 publications
0
4
0
1
Order By: Relevance
“…However, manually labelled pixel-level image samples can provide a large amount of detailed information and local features, which is more conducive to improving the training efficiency and segmentation accuracy of the crack segmentation network model [ 56 , 57 ]. Gong et al [ 58 ] proposed the concept of dataset life cycle and summarised it into five stages: dataset collection, dataset labelling, dataset storage, dataset testing and dataset destruction. Dataset labelling is a key process for transforming raw data into machine-identifiable information, and the quality of labelling is an important factor affecting the performance of machine learning models.…”
Section: Discussionmentioning
confidence: 99%
“…However, manually labelled pixel-level image samples can provide a large amount of detailed information and local features, which is more conducive to improving the training efficiency and segmentation accuracy of the crack segmentation network model [ 56 , 57 ]. Gong et al [ 58 ] proposed the concept of dataset life cycle and summarised it into five stages: dataset collection, dataset labelling, dataset storage, dataset testing and dataset destruction. Dataset labelling is a key process for transforming raw data into machine-identifiable information, and the quality of labelling is an important factor affecting the performance of machine learning models.…”
Section: Discussionmentioning
confidence: 99%
“…This methodology aligns with a quantitative research approach, which involves the systematic collection, analysis, and interpretation of numerical data to address research objectives as shown in Figure 1. By applying quantitative techniques such as data cleaning, preprocessing, model selection, and evaluation metrics, the research aims to provide practical insights and predictive models for property price prediction [18]. Additionally, the incorporation of statistical analyses, such as ANOVA, further enhances the rigor and reliability of the findings.…”
Section: Methodsmentioning
confidence: 99%
“…Text datasets are essential in Natural Language Processing (NLP) (Fanni et al, 2023) tasks such as sentiment analysis (Kapočiūtė-Dzikienė & Salimbajevs, 2022;Shaik et al, 2023;Mercha & Benbrahim, 2023), text classification (Štrimaitis et al, 2022;Palanivinayagam et al, 2023) and semantic analysis (Maulud et al, 2021). These main types of datasets are discussed by Gong et al (2023). Provides a summary of these datasets, indicating the main types of application tasks, the amount of data, and the data content of the datasets.…”
Section: Introductionmentioning
confidence: 99%
“…However, the main aim of this paper is to provide a comprehensive review of research on dataset quality. Because the effectiveness of ML models greatly hinges on the quality of the dataset used to train and evaluate the models (Gong et al, 2023).…”
Section: Introductionmentioning
confidence: 99%