Active Learning for Skewed Data Sets

Kazerouni, Abbas; Zhao, Qi; Xie, Jing; Tata, Sandeep; Najork, Marc

doi:10.48550/arxiv.2005.11442

Cited by 4 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this direction, although uncertainty sampling [8,23,9] is an effective method for mining generic informative examples [8,22,23], they are known to be ineffective in mining minority-class examples [1,2,3,6]. This has been attributed to the fact that uncertainty sampling, being biased on previously seen examples, ignores regions that are underrepresented in the initially labeled dataset [14]. Several approaches [7,15,26,11] propose to account for the skewness by using prior information about class imbalance to boost query scores corresponding to tail classes.…”

Section: Related Workmentioning

confidence: 99%

“…As a workaround, recent approaches [5,14] propose augmenting uncertainty sampling with an exploration/ geometry/ redundancy criteria in the input space. The key insight is to allow exploration to new uncertain areas in the input space.…”

Section: Related Workmentioning

confidence: 99%

“…The key insight is to allow exploration to new uncertain areas in the input space. They either use bandit algorithms [14] or linear programming formulation [5] as an optimization objective to choose among different criteria. However, the exploration/ geometry/ redundancy criteria in the input space perform poorly in high-dimensional spaces.…”

Section: Related Workmentioning

confidence: 99%

“…Curves towards the top-right are better. [15], and 5) HAL (exploration based) [14]. p i k is softmax probability for example i and b k is the proportion of samples corresponding to the k th class among C total classes.…”

Section: Datasets and Setupmentioning

confidence: 99%

“…minority-class). Similarly, this problem naturally occurs in many domains like fraud detection, spam detection, autograding systems, and is so pervasive that even carefully crafted research datasets like ImageNet, Google Landmarks, and MS-CELEBS-1M are all class-imbalanced [1,14,10,24].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Mining Minority-class Examples With Uncertainty Estimates

Singh¹,

Chu²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the training dataset. However, mining tail-class examples is a very challenging task. For instance, most of the otherwise successful uncertainty-based mining approaches struggle due to distortion of class probabilities resulting from skewness in data. In this work, we propose an effective, yet simple, approach to overcome these challenges. Our framework enhances the subdued tail-class activations and, thereafter, uses a one-class data-centric approach to effectively identify tail-class examples. We carry out an exhaustive evaluation of our framework on three datasets spanning over two computer vision tasks. Substantial improvements in the minority-class mining and fine-tuned model's task performance strongly corroborate the value of our method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Datasets and Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Mining Minority-class Examples With Uncertainty Estimates

Singh¹,

Chu²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

A rare event classification in the advanced manufacturing system: focused on imbalanced datasets

Lee¹

View full text Add to dashboard Cite

In many industrial applications, classification tasks are often associated with imbalanced class labels in training datasets. Imbalanced datasets can severely affect the accuracy of class predictions, and thus they need to be handled by appropriate data processing before analyzing the data since most machine learning techniques assume that the input data is balanced. When this imbalance problem comes with highdimensional space, feature extraction can be applied. In Chapter 2, we present two versions of feature extraction techniques called CL-LNN and RD-LNN in a time series dataset based on the nearest neighbor combined with machine learning algorithms to detect a failure of the paper manufacturing machinery earlier than its occurrence from the multi-stream system monitoring data. The nearest neighbor is applied to each separate feature instead of the whole 61 features to address the curse of dimensionality. Also, another technique for the skewness between class labels can be solved by either oversampling minorities or downsampling majorities in class. In the chapter 3, we are seeking to find a better way of downsampling by selecting the most informative samples in the given imbalanced dataset through the active learning strategy to mitigate the effect of imbalanced class labels. The data selection for downsampling is performed by the criterion used in optimal experimental designs, from which the generalization error of the trained model is minimized in a sequential manner under the penalized logistic regression as a classification model. We also suggest that the performance is significantly improved, especially with the highly imbalanced dataset, e.g., the imbalanced ratio is greater than ten if tuning hyper-parameter and costweight method are applied to the active downsampling technique. The research is further extended to cover nonlinearity using nonparametric logistic regression, and performance-based active learning (PBAL) is proposed to enhance the performance compared to the existing ones such as D-optimality and A-optimality.

show abstract

Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem

Naushin

Das

Nayak

et al. 2023

Axioms

View full text Add to dashboard Cite

Class imbalance is a prevalent problem that not only reduces the performance of the machine learning techniques but also causes the lacking of the inherent complex characteristics of data. Though the researchers have proposed various ways to deal with the problem, they have yet to consider how to select a proper treatment, especially when uncertainty levels are high. Applying rough-fuzzy theory to the imbalanced data learning problem could be a promising research direction that generates the synthetic data and removes the outliers. The proposed work identifies the positive, boundary, and negative regions of the target set using the rough set theory and removes the objects in the negative region as outliers. It also explores the positive and boundary regions of the rough set by applying the fuzzy theory to generate the samples of the minority class and remove the samples of the majority class. Thus the proposed rough-fuzzy approach performs both oversampling and undersampling to handle the imbalanced class problem. The experimental results demonstrate that the novel technique allows qualitative and quantitative data handling.

show abstract

Active Learning for Skewed Data Sets

Cited by 4 publications

References 29 publications

Mining Minority-class Examples With Uncertainty Estimates

Mining Minority-class Examples With Uncertainty Estimates

A rare event classification in the advanced manufacturing system: focused on imbalanced datasets

Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem

Contact Info

Product

Resources

About