A Resampling Method for Imbalanced Datasets Considering Noise and Overlap

Sasada, Taisho; Liu, Zhaoyu; Baba, Tokiya; Hatano, Kenji; Kimura, Yusuke

doi:10.1016/j.procs.2020.08.043

Cited by 43 publications

(24 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For finding the hyperparameters described in Table 2 , we followed the subsequent process: (1) The ML algorithm was selected: Random Forest, Decision Tree, MLP, CNN, Logistic Regression, or SVM; (2) the hyperparameter range for the selected ML algorithm was established, and we established different values for every hyperparameter to explore several combinations; (3) techniques for imbalanced datasets were established, and we applied two types, data-level (random undersampling, random oversampling, and SMOTE-Tomek [ 50 ]) and algorithm-level (balanced weights and cost-sensitive); and (4) for each technique toward imbalanced data sets, a hyperparameter search was implemented using the selected ML algorithm, its hyperparameter range, stratified k-fold cross-validation, and the balanced accuracy for evaluation. The ML algorithm was trained with the first hyperparameter combination using stratified k-fold cross-validation with

over the training set.…”

Section: Methodsmentioning

confidence: 99%

“…The MLP comprises three dense layers using the SMOTE-tomek technique [ 50 ] with the following hyperparameters: five neurons for the first two layers and one neuron at the output layer since the problem is binary. An ReLU was the activation function for the first two layers, and a sigmoid function was used at the output layer.…”

Section: Methodsmentioning

confidence: 99%

“…In Logistic Regression, the used technique was SMOTE-tomek [ 50 ] and the following hyperparameters: an l2 penalty, a tolerance for stopping criteria (tol) of 0.0001, an inverse of regularization strength C of 10,000, the algorithm newton_cg for the optimization problem, and a maximum iteration of 1000, which is the number of iterations taken for the solvers to converge. Logistic Regression hardware architecture is formed by a Multiplier-Adder 1, a Sigmoid function, and a comparator, as is shown in Figure 11 .…”

Section: Methodsmentioning

confidence: 99%

“…For SVM, the contemplated technique was SMOTE-tomek [ 50 ], using as hyperparameters

(Radial Basis Function), a value of C = 464.1588, and a scale gamma value where the kernel function

uses these last 2 values. The parameter C trades off misclassification of training examples against the simplicity of the decision surface, whereas gamma defines how much influence a single training can have.…”

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations

Trade-Off Analysis of Hardware Architectures for Channel-Quality Classification Models

Torres-Alvarado

Morales-Rosales

Algredo‐Badillo

et al. 2022

Sensors

View full text Add to dashboard Cite

The latest generation of communication networks, such as SDVN (Software-defined vehicular network) and VANETs (Vehicular ad-hoc networks), should evaluate their communication channels to adapt their behavior. The quality of the communication in data networks depends on the behavior of the transmission channel selected to send the information. Transmission channels can be affected by diverse problems ranging from physical phenomena (e.g., weather, cosmic rays) to interference or faults inherent to data spectra. In particular, if the channel has a good transmission quality, we might maximize the bandwidth use. Otherwise, although fault-tolerant schemes degrade the transmission speed by solving errors or failures should be included, these schemes spend more energy and are slower due to requesting lost packets (recovery). In this sense, one of the open problems in communications is how to design and implement an efficient and low-power-consumption mechanism capable of sensing the quality of the channel and automatically making the adjustments to select the channel over which transmit. In this work, we present a trade-off analysis based on hardware implementation to identify if a channel has a low or high quality, implementing four machine learning algorithms: Decision Trees, Multi-Layer Perceptron, Logistic Regression, and Support Vector Machines. We obtained the best trade-off with an accuracy of 95.01% and efficiency of 9.83 Mbps/LUT (LookUp Table) with a hardware implementation of a Decision Tree algorithm with a depth of five.

show abstract

over the training set.…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

“…For SVM, the contemplated technique was SMOTE-tomek [ 50 ], using as hyperparameters

(Radial Basis Function), a value of C = 464.1588, and a scale gamma value where the kernel function

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Trade-Off Analysis of Hardware Architectures for Channel-Quality Classification Models

Torres-Alvarado

Morales-Rosales

Algredo‐Badillo

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…Author et al: A real-life machine learning experience for predicting school dropout at different stages techniques introduced above), there are 783 instances corresponding to students who drop out of college and 635 to students who complete their studies. To equally distribute the classes with a 1:1 ratio, we use a combination of SMOTE and Tomek Links methods [21], [22]. It takes observations from the datasets and looks for other observations that are closer (neighbours).…”

Section: ) Resamplingmentioning

confidence: 99%

A Real-Life Machine Learning Experience for Predicting University Dropout at Different Stages Using Academic Data

et al. 2021

View full text Add to dashboard Cite

High levels of school dropout are a major burden on the educational and professional development of a country's inhabitants. A country's prosperity depends, among other factors, on its ability to produce higher education graduates capable of moving a country forward. To alleviate the dropout problem, more and more institutions are turning to the possibilities that artificial intelligence can provide to predict dropout as early as possible. The difficulty of accessing personal data and privacy issues that it entails force the institutions to rely on the Academic Data of their students to create accurate and reliable predictive systems. This work focuses on creating the best possible predictive model based solely on academic data, and accordingly, its capacity to infer knowledge must be maximised. Thus, Feature Engineering and Instance Engineering techniques such as dealing with redundancy, significance of the features, correlation, cardinality features, missing values, creation or elimination of features, data fusion, removal of unuseful instances, binning, resampling, normalisation, or encoding are applied in detail before the construction of well-known models such as Gradient Boosting, Random Forest, and Support Vector Machine along with an Ensemble of them at different stages: prior to enrolment, at the end of the first semester, at the end of the second semester, at the end of the third semester, and at the end of the fourth semester. Through the construction of these predictive models that serve as inputs to a decision support system, the application of effective dropout prevention policies can be applied.

show abstract

Dealing with missing values in proteomics data

et al. 2022

View full text Add to dashboard Cite

Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.

show abstract

A Resampling Method for Imbalanced Datasets Considering Noise and Overlap

Cited by 43 publications

References 16 publications

Trade-Off Analysis of Hardware Architectures for Channel-Quality Classification Models

Trade-Off Analysis of Hardware Architectures for Channel-Quality Classification Models

A Real-Life Machine Learning Experience for Predicting University Dropout at Different Stages Using Academic Data

Dealing with missing values in proteomics data

Contact Info

Product

Resources

About