2019 International Joint Conference on Neural Networks (IJCNN) 2019
DOI: 10.1109/ijcnn.2019.8851920
|View full text |Cite
|
Sign up to set email alerts
|

Identifying Mislabeled Instances in Classification Datasets

Abstract: A key requirement for supervised machine learning is labeled training data, which is created by annotating unlabeled data with the appropriate class. Because this process can in many cases not be done by machines, labeling needs to be performed by human domain experts. This process tends to be expensive both in time and money, and is prone to errors. Additionally, reviewing an entire labeled dataset manually is often prohibitively costly, so many real world datasets contain mislabeled instances.To address this… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(12 citation statements)
references
References 16 publications
0
12
0
Order By: Relevance
“…On the horizontal axis of each cell is the epoch of training. For each dataset–network combination, the red vertical line marks the beginning of the effective beginning of TPT (i.e., the epoch when the training accuracy reaches 99.6% for ImageNet and 99.9% for the remaining datasets); we do not use 100% as it has been reported ( 24 26 ) that several of these datasets contain inconsistencies and mislabels, which sometimes prevent absolute memorization. Additionally, orange lines denote measurements on the network classifier, while blue lines denote measurements on the activation class means.…”
Section: Resultsmentioning
confidence: 99%
“…On the horizontal axis of each cell is the epoch of training. For each dataset–network combination, the red vertical line marks the beginning of the effective beginning of TPT (i.e., the epoch when the training accuracy reaches 99.6% for ImageNet and 99.9% for the remaining datasets); we do not use 100% as it has been reported ( 24 26 ) that several of these datasets contain inconsistencies and mislabels, which sometimes prevent absolute memorization. Additionally, orange lines denote measurements on the network classifier, while blue lines denote measurements on the activation class means.…”
Section: Resultsmentioning
confidence: 99%
“…A more recent study by Müller and Markert (2019) introduced a pipeline that can identify mislabeled data in numerical, image, and natural language datasets. The efficacy of their pipeline was evaluated by introducing noisy data, or data that was intentionally changed to be different from its original label, in an amount of 1%, 2%, or 3%, into 29 well-known realworld and synthetic classification datasets.…”
Section: Algorithmic Curation Of Other Datasetsmentioning
confidence: 99%
“…Using machine learning-based majority voting or consensus filtering methods has been applied extensively in prior research for classification datasets focused on topics such as finance, medical diagnosis, and news media (Brodley & Friedl, 1999;Ekambaram et al, 2017;Guan et al, 2011;Müller & Markert, 2019;Samami et al, 2020). However, to the best of our knowledge, these methods have not yet been applied to cyberbullying datasets.…”
Section: Algorithmic Curation Of Other Datasetsmentioning
confidence: 99%
“…Bhardwaj et al (2010) uses statistical methods to find annotators whose annotations differ considerably from the remaining annotators and use manual inspection to decide the verdict for samples annotated by these annotators. Müller and Markert (2019) classifies training samples with the lowest gold label probabilities on a robust classifier as potentially mislabelled followed by their manual review for final decision. Zhang and Sugiyama (2021) detects samples with erroneous labels using an instance-dependent noise model along with instance-based embedding to capture instance-specific label corruption.…”
Section: Related Workmentioning
confidence: 99%