Background: The data extracted from various fields inherently consists of extremely correlated measurements in parallel with the exponential increase in the size of the data that need to be interpreted owing to the technological advances. This problem, called the multicollinearity, influences the performance of both statistical and machine learning algorithms. Statistical models proposed as a potential remedy to this problem have not been sufficiently evaluated in the literature. Therefore, a comprehensive comparison of statistical and machine learning models is required for addressing the multicollinearity problem.
Methods: Statistical models (including Ridge, Liu, Lasso and Elastic Net regression) and the eight most important machine learning algorithms (including Cart, Knn, Mlp, MARS, Cubist, Svm, Bagging and XGBoost) are comprehensively compared by using two different healthcare datasets (including Body Fat and Cancer) having multicollinearity problem. The performance of the models is assessed through cross validation methods via root mean square error, mean absolute error and r-squared criteria.
Results: The results of the study revealed that statistical models outperformed machine learning models in terms of root mean square error, mean absolute error and r-squared criteria in both training and testing performance. Particularly the Liu regression often achieved better relative performance (up to 7.60% to 46.08% for Body Fat data set and up to 1.55% to 21.53% for Cancer data set on training performance and up to 1.56% to 38.08% for Body Fat data set and up to 3.50% to 23.29% for Cancer data set on testing performance) among regression methods as well as compared to machine algorithms.
Conclusions: Liu regression is mostly disregarded in the machine learning literature, but since it outperforms the most powerful and widely used machine learning algorithms, it appears to be a promising tool in almost all fields, especially for regression-based studies including data with multicollinearity problem.