2013
DOI: 10.1021/ci4000536
|View full text |Cite
|
Sign up to set email alerts
|

Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods

Abstract: The traditional biological assay is very time-consuming, and thus the ability to quickly screen large numbers of compounds against a specific biological target is appealing. To speed up the biological evaluation of compounds, high-throughput screening is widely used in the fields of biomedical, biological information, and drug discovery. The research presented in this study focuses on the use of support vector machines, a machine learning method, various classes of molecular descriptors, and different sampling… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
51
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 44 publications
(51 citation statements)
references
References 44 publications
0
51
0
Order By: Relevance
“…Strategies proposed for dealing with imbalanced dataset range mainly from affecting specific costs to training set 58,59 , re-sampling the training set, either by over-sampling the minority class 60,61 , and/or under-sampling the majority class 62,63 . Many variants of these techniques exist and have been reviewed by López et al .…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Strategies proposed for dealing with imbalanced dataset range mainly from affecting specific costs to training set 58,59 , re-sampling the training set, either by over-sampling the minority class 60,61 , and/or under-sampling the majority class 62,63 . Many variants of these techniques exist and have been reviewed by López et al .…”
Section: Introductionmentioning
confidence: 99%
“…Although imbalanced data has been used in many studies dealing with soil classification 34,39,66 no such method has, to our knowledge, been applied for legacy soil data from a tropical semi-arid environment. In addition, we compared this method with the random oversampling (ROS) approach 59,60 . Having considered the pruning approach, we hypothesized that instance selection on the majority soil group, along with model-based feature selection, would improve the performance of the RF models and result in a stronger response of the minority soil groups.…”
Section: Introductionmentioning
confidence: 99%
“…Under-sampling is suitable for such applications where the number of majority samples is immense and decreasing the training samples will reduce the model training time. However, a drawback with under-sampling that discards samples leads to the loss of information for the majority class [17]. …”
Section: Introductionmentioning
confidence: 99%
“…Guha et al [8] constructed Random Forest (RF) ensemble models to classify the cell proliferation datasets in PubChem, producing classification rate on the prediction sets in a range between 70% to 85% depending on the nature of datasets and descriptors employed. Chang et al [17] applied the over-sampling technique to explore the relationship between dataset composition, molecular descriptor and predictive modeling method, concluding that SVM models constructed from over-sampled dataset exhibited better predictive ability for the training and external test sets compared to previous results in the literature. Though several proposed methods have successfully countered the imbalanced datasets in PubChem, however, many of the previous works were time consuming in calculation and little work explored the problem of enhancement in the computational efficiency in addition to the statistical performance, which in turn should be largely addressed in the era of big data.…”
Section: Introductionmentioning
confidence: 99%
“…In a study of Chang et al ., 92 the simple oversampling technique was used to develop SVM models that classify compounds according to predicted cytotoxicity against the Jurkat cell line. It was demonstrated that oversampling of the minority class (toxic compounds) leads to SVM models with better predictive ability for both the training and external test sets, compared to results reported in previous studies.…”
Section: Dealing With Data Imbalance Issues In Pubchem Datamentioning
confidence: 99%