Class imbalanced datasets are common across different domains including health, security, banking and others. A typical supervised learning algorithm tends to be biased towards the majority class when dealing with imbalanced datasets. The learning task becomes more challenging when there is also an overlap of instances from different classes. In this paper, we propose an undersampling framework for handling class imbalance in binary datasets by removing potential overlapped data points. Our methods are designed to identify and eliminate majority class instances from the overlapping region. Accurate identification and elimination of these instances maximises the visibility of the minority class instances and at the same time minimises excessive elimination of data, which reduces information loss. Four methods based on neighbourhood searching with different criteria to identify potential overlapped instances are proposed in this paper. Extensive experiments using simulated and real-world datasets were carried out. Results show comparable performance with state-of-the-art methods across different common metrics with exceptional and statistically significant improvements in sensitivity.
Technical variability during DNA capture probe printing remains an important obstacle to obtaining high quality data from microarray experiments. While methods that use fluorescent labels for visualizing printed arrays prior to hybridization have been presented, the ability to measure spot density using label-free techniques would provide valuable information on spot quality without altering standard microarray protocols. In this study, we present the use of a photonic crystal biosensor surface and a high resolution label-free imaging detection instrument to generate prehybridization images of spotted oligonucleotide microarrays. Spot intensity, size, level of saturation, and local background intensity were measured from these images. This information was used for the automated identification of missed spots (due to mechanical failure or sample depletion) as well as the assignment of a score that reflected the quality of each printed feature. Missed spots were identified with >95% sensitivity. Furthermore, filtering based on spot quality scores increased pairwise correlation of posthybridization spot intensity between replicate arrays, demonstrating that label-free spot quality scores captured the variability in the microarray data. This imaging modality can be applied for the quality control of printed cDNA, oligonucleotide, and protein microarrays.
Classification of imbalanced data remains an important field in machine learning. Several methods have been proposed to address the class imbalance problem including data resampling, adaptive learning and cost adjusting algorithms. Data resampling methods are widely used due to their simplicity and flexibility. Most existing resampling techniques aim at rebalancing class distribution. However, class imbalance is not the only factor that impacts the performance of the learning algorithm. Class overlap has proved to have a higher impact on the classification of imbalanced datasets than the dominance of the negative class. In this paper, we propose a new undersampling method that eliminates negative instances from the overlapping region and hence improves the visibility of the minority instances. Testing and evaluating the proposed method using 36 public imbalanced datasets showed statistically significant improvements in classification performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.