Scaling up data mining algorithms: review and taxonomy

It has repeatedly been shown that most classification methods suffer from an imbalanced distribution of training instances among classes. Most learning algorithms expect an approximately even distribution of instances among the different classes and suffer, to different degrees, when that is not the case. Dealing with the class-imbalance problem is a difficult but relevant task, as many of the most interesting and challenging real-world problems have a very uneven class distribution. In this paper we present a new approach for dealing with class-imbalanced datasets based on a new boosting method for the construction of ensembles of classifiers. The approach is based on using the distribution of the weights given by a given boosting algorithm for obtaining a supervised projection. Then, the supervised projection is used to train the next classifier using a uniform distribution of the training instances. We tested our method using 35 class-imbalanced datasets and two different base classifiers: a decision tree and a support vector machine. The proposed methodology proved its usefulness achieving better accuracy than other methods both in terms of the geometric mean of specificity and sensibility and the area under the ROC curve.

show abstract

“…The main problem of the method described above is the scalability [24]. When we deal with a large dataset, the cost of the RCGA is high.…”

Section: Constructing Supervised Projections Using a Rcgamentioning

confidence: 99%

Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections

García-Pedrajas

García‐Osorio

2012

Prog Artif Intell

Self Cite

View full text Add to dashboard Cite

show abstract

“…Big data, so far, does not have a formal definition, although it is generally accepted that the concept refers to datasets that are too large to be processed using conventional data processing tools and techniques. Contemporary information systems produce data in huge quantities that are difficult to be measured [1]. It means that we already have found ourselves in the "big data era," and the question of how to solve largescale machine learning problems is open and requires a lot of research efforts.…”

Section: Introductionmentioning

confidence: 99%

An Approach to Data Reduction for Learning from Big Datasets: Integrating Stacking, Rotation, and Agent Population Learning Techniques

Czarnowski

Jędrzejowicz

2018

Complexity

View full text Add to dashboard Cite

In the paper, several data reduction techniques for machine learning from big datasets are discussed and evaluated. The discussed approach focuses on combining several techniques including stacking, rotation, and data reduction aimed at improving the performance of the machine classification. Stacking is seen as the technique allowing to take advantage of the multiple classification models. The rotation-based techniques are used to increase the heterogeneity of the stacking ensembles. Data reduction makes it possible to classify instances belonging to big datasets. We propose to use an agent-based population learning algorithm for data reduction in the feature and instance dimensions. For diversification of the classifier ensembles within the rotation also, alternatively, principal component analysis and independent component analysis are used. The research question addressed in the paper is formulated as follows: does the performance of a classifier using the reduced dataset be improved by integrating the data reduction mechanism with the rotation-based technique and the stacking?

show abstract

“…It is usually assumed in the literature that linear-time algorithms are acceptable for scaling up to large datasets [8]. We also adopt this assumption.…”

Section: Introductionmentioning

confidence: 99%

“…8 show the results for the SVHN and XM2VTS datasets when using spherical hashing to speed-up the proposed methods, Here we name this version LSH, from Limited Spherical Hashing since we do not use the original procedure. Instead, we use a limited version which uses the dissimilarities instead of the binary codification since the latter increases the classification error, while the use of dissimilarities still provides Accuracy and execution times results when using LSH to speed up the proposed prototype selection methods in SVHN dataset Accuracy and execution times results when using SH to speed up the proposed prototype selection methods in XM2VTS dataset…”

mentioning

confidence: 99%

Towards Scalable Prototype Selection by Genetic Algorithms with Fast Criteria

Plasencia-Calaña

Orozco-Alzate

Méndez-Vázquez

et al. 2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Classification in the dissimilarity space has become a very active research area since it provides a possibility to learn from data given in the form of pairwise non-metric dissimilarities, which otherwise would be difficult to cope with. The selection of prototypes is a key step for the further creation of the space. However, despite previous efforts to find good prototypes, how to select the best representation set remains an open issue. In this paper we proposed scalable methods to select the set of prototypes out of very large datasets. The methods are based on genetic algorithms, dissimilarity-based hashing, and two different unsupervised and supervised scalable criteria. The unsupervised criterion is based on the Minimum Spanning Tree of the graph created by the prototypes as nodes and the dissimilarities as edges. The supervised criterion is based on counting matching labels of objects and their closest prototypes. The suitability of these type of algorithms is analyzed for the specific case of dissimilarity representations. The experimental results showed that the methods select good prototypes taking advantage of the large datasets, and they do so at low runtimes.

show abstract

Scaling up data mining algorithms: review and taxonomy

Cited by 30 publications

References 112 publications

Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections

Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections

An Approach to Data Reduction for Learning from Big Datasets: Integrating Stacking, Rotation, and Agent Population Learning Techniques

Towards Scalable Prototype Selection by Genetic Algorithms with Fast Criteria

Contact Info

Product

Resources

About