19th IEEE International Conference on Tools With Artificial Intelligence(ICTAI 2007) 2007
DOI: 10.1109/ictai.2007.71
|View full text |Cite
|
Sign up to set email alerts
|

Mining Data with Rare Events: A Case Study

Abstract: The performance of classification models can be negatively impacted if the data on which they are trained contains very rare events. While recent research has investigated the issue of class imbalance, few if any studies address issues related to the handling of extreme imbalance (rare events), where the minority class can account for as little as 0.1% of the training data. This work investigates the effect of dataset size and class distribution on classification performance when examples from the minority cla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2010
2010
2024
2024

Publication Types

Select...
6
3
1

Relationship

2
8

Authors

Journals

citations
Cited by 48 publications
(20 citation statements)
references
References 11 publications
0
20
0
Order By: Relevance
“…Seiffert and colleagues 21 showed that data-sampling approaches can increase classification performance when rare classes comprise 0.1-1.6% of a dataset. Other suitable approaches include oversampling, undersampling, cost-sensitive learning, ensemble methods, and constructing k neural networks.…”
Section: Identifying Rare Classesmentioning
confidence: 99%
“…Seiffert and colleagues 21 showed that data-sampling approaches can increase classification performance when rare classes comprise 0.1-1.6% of a dataset. Other suitable approaches include oversampling, undersampling, cost-sensitive learning, ensemble methods, and constructing k neural networks.…”
Section: Identifying Rare Classesmentioning
confidence: 99%
“…These classifiers were selected to provide good coverage of various ML model families. Performance-wise, the three classifiers are regarded favorably, and they incorporate both ensemble and non-ensemble algorithms, providing a reasonable breadth of fraud detection results for assessing the impact of rarity in Big Data [33,34]. In this section, we describe each model and note configuration and hyperparameter changes that differ from the default settings.…”
Section: Classifiersmentioning
confidence: 99%
“…The actual size in bytes to be considered "sufficient" is dependant on the cluster's underlying hardware. Class imbalance has been shown to present unique challenges and negative effects to model performance [45][46][47][48].…”
Section: Medicare Part B Datasetmentioning
confidence: 99%