Mining Data with Rare Events: A Case Study

Veras, R.C.; Meira, Sílvio R. L.; Oliveira, Adriano L. I.; Melo, Bruno J. M.

doi:10.1109/ictai.2007.71

Cited by 48 publications

(20 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Seiffert and colleagues 21 showed that data-sampling approaches can increase classification performance when rare classes comprise 0.1-1.6% of a dataset. Other suitable approaches include oversampling, undersampling, cost-sensitive learning, ensemble methods, and constructing k neural networks.…”

Section: Identifying Rare Classesmentioning

confidence: 99%

Using statistical text classification to identify health information technology incidents

Chai

Anthony

Coiera

et al. 2013

J Am Med Inform Assoc

View full text Add to dashboard Cite

Section: Identifying Rare Classesmentioning

confidence: 99%

Using statistical text classification to identify health information technology incidents

Chai

Anthony

Coiera

et al. 2013

J Am Med Inform Assoc

View full text Add to dashboard Cite

“…These classifiers were selected to provide good coverage of various ML model families. Performance-wise, the three classifiers are regarded favorably, and they incorporate both ensemble and non-ensemble algorithms, providing a reasonable breadth of fraud detection results for assessing the impact of rarity in Big Data [33,34]. In this section, we describe each model and note configuration and hyperparameter changes that differ from the default settings.…”

Section: Classifiersmentioning

confidence: 99%

Investigating class rarity in big data

et al. 2020

Self Cite

View full text Add to dashboard Cite

IntroductionWhen called upon to define big data, researchers and practitioners in the field of data science frequently refer to the six V's: volume, variety, velocity, variability, value, and veracity [1]. Volume, most certainly the best-known property of big data, is associated with the profusion of data produced by an organization. Variety covers the handling of structured, unstructured, and semi-structured data. Velocity takes into account how quickly data is manufactured, issued, and dealt with. Variability refers to the fluctuations in data. Value is often regarded as a critical attribute because it is required for effective decision-making. Veracity is associated with the fidelity of data. AbstractIn Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.

show abstract

“…The actual size in bytes to be considered "sufficient" is dependant on the cluster's underlying hardware. Class imbalance has been shown to present unique challenges and negative effects to model performance [45][46][47][48].…”

Section: Medicare Part B Datasetmentioning

confidence: 99%

A parallel and distributed stochastic gradient descent implementation using commodity clusters

Kennedy

Khoshgoftaar

Villanustre³

et al. 2019

J Big Data

Self Cite

View full text Add to dashboard Cite

IntroductionTraining neural networks effectively and efficiently is an important component of Deep Learning. Large neural networks can consist of dozens, hundreds or even thousands of layers each with thousands of artificial neurons. Depending on the network's architecture, each of these neurons is connected to a large number of other neurons, where each connection has a trainable weight parameter that determines how the network responds to input signals. In the context of this paper, the effective training of these large complex networks is accomplished through the use of the computationally expensive process of backpropagation. Additionally, neural networks benefit from training on Big Data, as typically more data produces more performant models [1]. For example, the ImageNet database AlexNet was trained on roughly 1.2 million images, and at the time achieved state of the art results [2]. Problems of this magnitude are common and thus researching parallel network optimization on distributed and parallel systems is highly important. AbstractDeep Learning is an increasingly important subdomain of artificial intelligence, which benefits from training on Big Data. The size and complexity of the model combined with the size of the training dataset makes the training process very computationally and temporally expensive. Accelerating the training process of Deep Learning using cluster computers faces many challenges ranging from distributed optimizers to the large communication overhead specific to systems with off the shelf networking components. In this paper, we present a novel distributed and parallel implementation of stochastic gradient descent (SGD) on a distributed cluster of commodity computers. We use high-performance computing cluster (HPCC) systems as the underlying cluster environment for the implementation. We overview how the HPCC systems platform provides the environment for distributed and parallel Deep Learning, how it provides a facility to work with third party open source libraries such as TensorFlow, and detail our use of third-party libraries and HPCC functionality for implementation. We provide experimental results that validate our work and show that our implementation can scale with respect to both dataset size and the number of compute nodes in the cluster.

show abstract

Mining Data with Rare Events: A Case Study

Cited by 48 publications

References 11 publications

Using statistical text classification to identify health information technology incidents

Using statistical text classification to identify health information technology incidents

Investigating class rarity in big data

A parallel and distributed stochastic gradient descent implementation using commodity clusters

Contact Info

Product

Resources

About