In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of Email addresses: triguero@decsai.ugr.es (Isaac Triguero), dperalta@decsai.ugr.es (Daniel Peralta), jaume.bacardit@newcastle.ac.uk (Jaume Bacardit), sglopez@ujaen.es (Salvador García), herrera@decsai.ugr.es (Francisco Herrera) March 3, 2014 instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data. Preprint submitted to Neurocomputing
Imaging flow cytometry (IFC) produces up to 12 spectrally distinct, information‐rich images of single cells at a throughput of 5,000 cells per second. Yet often, cell populations are still studied using manual gating, a technique that has several drawbacks, hence it would be advantageous to replace manual gating with an automated process. Ideally, this automated process would be based on stain‐free measurements, as the currently used staining techniques are expensive and potentially confounding. These stain‐free measurements originate from the brightfield and darkfield image channels, which capture transmitted and scattered light, respectively. To realize this automated, stain‐free approach, advanced machine learning (ML) methods are required. Previous works have successfully tested this approach on cell cycle phase classification with both a classical ML approach based on manually engineered features, and a deep learning (DL) approach. In this work, we compare both approaches extensively on the problem of white blood cell classification. Four human whole blood samples were assayed on an ImageStream‐X MK II imaging flow cytometer. Two samples were stained for the identification of eight white blood cell types, while two other sample sets were stained for the identification of resting and active eosinophils. For both data sets, four ML classifiers were evaluated on stain‐free imagery with stratified 5‐fold cross‐validation. On the white blood cell data set, the best obtained results were 0.778 and 0.703 balanced accuracy for classical ML and DL, respectively. On the eosinophil data set, this was 0.871 and 0.856 balanced accuracy. We conclude that classifying cell types based on only stain‐free images is possible with all four classifiers. Noteworthy, we also find that the DL approaches tested in this work do not outperform the approaches based on manually engineered features. © 2019 International Society for Advancement of Cytometry
Fingerprint recognition has found a reliable application for verification or identification of people in biometrics. Globally, fingerprints can be viewed as valuable traits due to several perceptions observed by the experts; such as the distinctiveness and the permanence on humans and the performance in real applications. Among the main stages of fingerprint recognition, the automated matching phase has received much attention from the early years up to nowadays. This paper is devoted to review and categorize the vast number of fingerprint matching methods proposed in the specialized literature. In particular, we focus on local minutiae-based matching algorithms, which provide good performance with an excellent trade-off between efficacy and efficiency. We identify the main properties and differences of existing methods. Then, we include an experimental evaluation involving the most representative local minutiae-based matching models in both verification and evaluation tasks. The results obtained will be discussed in detail, supporting the description of future directions.
Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditional methods lack enough scalability to cope with datasets of millions of instances and extract successful results in a delimited time. This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. The algorithm decomposes the original dataset in blocks of instances to learn from them in the map phase; then, the reduce phase merges the obtained partial results into a final vector of feature weights, which allows a flexible application of the feature selection procedure using a threshold to determine the selected subset of features. The feature selection method is evaluated by using three well-known classifiers (SVM, Logistic Regression, and Naive Bayes) implemented within the Spark framework to address big data problems. In the experiments, datasets up to 67 millions of instances and up to 2000 attributes have been managed, showing that this is a suitable framework to perform evolutionary feature selection, improving both the classification accuracy and its runtime when dealing with big data problems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.