2019
DOI: 10.1016/j.ins.2018.11.002
|View full text |Cite
|
Sign up to set email alerts
|

Subclass-based semi-random data partitioning for improving sample representativeness

Abstract: In machine learning tasks, it is essential for a data set to be partitioned into a training set and a test set in a specific ratio. In this context, the training set is used for learning a model for making predictions on new instances, whereas the test set is used for evaluating the prediction accuracy of a model on new instances. In the context of human learning, a training set can be viewed as learning material that covers knowledge, whereas a test set can be viewed as an exam paper that provides questions f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 36 publications
0
5
0
Order By: Relevance
“…Therefore, the Pearson Correlation method was selected as our feature selection method. According to the features selected through the Pearson correlation analysis, the selected machine-learning algorithms were tested based on a semi-random [32,33] partitioning framework. The performance evaluation of these models according to their accuracy showed that Logistic Regression and SVM with equal accuracy of 95.8% demonstrated the highest performance.…”
Section: Results Modelingmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, the Pearson Correlation method was selected as our feature selection method. According to the features selected through the Pearson correlation analysis, the selected machine-learning algorithms were tested based on a semi-random [32,33] partitioning framework. The performance evaluation of these models according to their accuracy showed that Logistic Regression and SVM with equal accuracy of 95.8% demonstrated the highest performance.…”
Section: Results Modelingmentioning
confidence: 99%
“…In addition, the results of the ROC curve investigation as shown in Figure 2 indicate that both SVM and Logistic Regression have almost the same AUC at 0.98. [32,33] partitioning framework. The performance evaluation of these models according to their accuracy showed that Logistic Regression and SVM with equal accuracy of 95.8% demonstrated the highest performance.…”
Section: Results Modelingmentioning
confidence: 99%
“…The percentage of data used in each dataset is variable. Some studies suggest that the best model performance is achieved when the training and testing databases account for 70% and 30% of the original dataset, respectively (Liu et al 2019); other authors concluded that the percentage of data corresponding to the training database can be as high as 80% (Gholamy et al 2018) or between 50% and 70% (Xu and Goodacre 2018). In this paper, we considered a ratio of 3 to 1 for the time periods corresponding to the training and test catalogues, respectively, at both national and regional levels.…”
Section: Application To Italymentioning
confidence: 99%
“…For this nationwide multicenter cohort study, various levels of care centers with different medical roles were included. Random data partitioning by mixing all the data can omit the characteristics of each center and impair sample representativeness [25]. Therefore, we performed center-based semi-random data partitioning, only considering the balance of the number of clinical events between development and validation sets.…”
Section: Dataset Partitioning For Multicenter Validationmentioning
confidence: 99%