Objective: To determine how the estimated performance of a machine learning model varies according to how a dataset is split into training and test sets using brain tumor radiomics data, under different conditions.Materials and Methods: Two binary tasks with different levels of difficulty ('simpleâ task, glioblastoma [GBM, n=109] vs. brain metastasis [n=58]; 'difficultâ task, low- [n=163] vs. high grade [n=95] meningiomas) were performed using radiomics features from magnetic resonance imaging (MRI). For each trial of the 1,000 different training-test set splits with a ratio of 7:3, a least absolute shrinkage and selection operator (LASSO) model was trained by 5-fold cross-validation (CV) in the training set and tested in the test set. The model stability and performance was evaluated according to the number of input features (from 1 to 50), the sample size (full vs. undersampled), and the level of difficulty. In addition to 5-fold CV without a repetition, three other CV methods were compared: 5-fold CV with 100 repetitions, nested CV, and nested CV with 100 repetitions.Results: The highest mean cross-validated area under the receiver operating characteristics curve (AUC) and the higher stability (lower AUC differences between training and testing) was achieved with 6 and 13 features from the GBM and meningioma task, respectively. For the simple task, simple task with undersampling, difficult task, and difficult task with undersampling, average mean AUCs were 0.947, 0.923, 0.795, and 0.764, and average AUC differences between training and testing were 0.029, 0.054, 0.053, and 0.108, respectively. Among four CV models, the most conservative method (i.e., lowest AUC and highest relative standard deviation [RSD]) was nested CV with 100 repetitions.Conclusions: A single random split of a dataset into training and test sets may lead to an unreliable report of model performance in radiomics machine learning studies, and reporting the mean and standard deviation of model performance metrics by performing nested and/or repeated CV on the entire dataset is suggested.