Training and assessing classification rules with imbalanced data

Menardi, Giovanna; Torelli, Nicola

doi:10.1007/s10618-012-0295-5

Cited by 568 publications

(300 citation statements)

References 48 publications

Supporting

Mentioning

294

Contrasting

Unclassified

Order By: Relevance

“…To design a robust predictive model, a balanced dataset is used to avoid possible bias caused by a majority class. To show the effectiveness of utilizing the balanced dataset, a comparative study was performed by measuring false negatives (FN) and false positives (FP) [50]. We measured Type I and II errors (i.e false positive and false negative, respectively) when using the balanced and imbalanced datasets.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Designing an Internet Traffic Predictive Model by Applying a Signal Processing Method

Choi

Jeong

2014

J Netw Syst Manage

View full text Add to dashboard Cite

Detection of abnormal internet traffic has become a significant area of research in network security. Due to its importance, many predictive models are designed by utilizing machine learning algorithms. The models are well designed to show high performances in detecting abnormal internet traffic behaviors. However, they may not guarantee reliable detection performances for new incoming abnormal internet traffic because they are designed using raw features from imbalanced internet traffic data. Since internet traffic is non-stationary time-series data, it is difficult to identify abnormal internet traffic with the raw features. In this study, we propose a new approach to detecting abnormal internet traffic. Our approach begins with extracting hidden, but important, features by utilizing discrete wavelet transformation. Then, statistical analysis is performed to filter out irrelevant and less important features. Only statistically significant features are used to design a reliable predictive model with logistic regression. A comparative analysis is conducted to determine the importance of our approach by measuring accuracy, sensitivity, and the Area Under the receiver operating characteristic Curve. From the analysis, we found that our model detects abnormal internet traffic successfully with high accuracy.

show abstract

Section: Resultsmentioning

confidence: 99%

“…The predictive model with the DWT features provided outperformed results in accuracy, sensitivity, specificity, and AUC. Among the four sliding window sizes (25,50,100, and 150 data points), the 150 data points sliding window showed a better performance than others.…”

Section: Classification Performance Comparisonmentioning

confidence: 94%

Designing an Internet Traffic Predictive Model by Applying a Signal Processing Method

Choi

Jeong

2014

J Netw Syst Manage

View full text Add to dashboard Cite

show abstract

“…To avoid this problem, we apply to the training dataset the sampling approach proposed in ROSE [51] that down-samples the majority class and synthesizes new examples in the minority class.…”

Section: Methodsmentioning

confidence: 99%

Identification of long non-coding transcripts with feature selection: a comparative study

et al. 2017

View full text Add to dashboard Cite

BackgroundThe unveiling of long non-coding RNAs as important gene regulators in many biological contexts has increased the demand for efficient and robust computational methods to identify novel long non-coding RNAs from transcripts assembled with high throughput RNA-seq data. Several classes of sequence-based features have been proposed to distinguish between coding and non-coding transcripts. Among them, open reading frame, conservation scores, nucleotide arrangements, and RNA secondary structure have been used with success in literature to recognize intergenic long non-coding RNAs, a particular subclass of non-coding RNAs.ResultsIn this paper we perform a systematic assessment of a wide collection of features extracted from sequence data. We use most of the features proposed in the literature, and we include, as a novel set of features, the occurrence of repeats contained in transposable elements. The aim is to detect signatures (groups of features) able to distinguish long non-coding transcripts from other classes, both protein-coding and non-coding. We evaluate different feature selection algorithms, test for signature stability, and evaluate the prediction ability of a signature with a machine learning algorithm. The study reveals different signatures in human, mouse, and zebrafish, highlighting that some features are shared among species, while others tend to be species-specific. Compared to coding potential tools and similar supervised approaches, including novel signatures, such as those identified here, in a machine learning algorithm improves the prediction performance, in terms of area under precision and recall curve, by 1 to 24%, depending on the species and on the signature.ConclusionsUnderstanding which features are best suited for the prediction of long non-coding RNAs allows for the development of more effective automatic annotation pipelines especially relevant for poorly annotated genomes, such as zebrafish. We provide a web tool that recognizes novel long non-coding RNAs with the obtained signatures from fasta and gtf formats. The tool is available at the following url: http://www.bioinformatics-sannio.org/software/.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-017-1594-z) contains supplementary material, which is available to authorized users.

show abstract

“…They also applied SVM-based classifiers, when the imbalance is extreme, novelty detectors are more accurate than balanced and unbalanced binary classifiers. Giovanna Menardi [12] et al, have discussed the effects of class imbalance on model training and model assessing. A unified and systematic framework for dealing with both the problems is proposed, based on a smoothed bootstrap re-sampling technique.…”

Section: Current Approaches In Decision Treesmentioning

confidence: 99%

An improved approach on class imbalance data using within-class minority oversampling technique

2016

IJLTET

View full text Add to dashboard Cite

Training and assessing classification rules with imbalanced data

Cited by 568 publications

References 48 publications

Designing an Internet Traffic Predictive Model by Applying a Signal Processing Method

Designing an Internet Traffic Predictive Model by Applying a Signal Processing Method

Identification of long non-coding transcripts with feature selection: a comparative study

An improved approach on class imbalance data using within-class minority oversampling technique

Contact Info

Product

Resources

About