2021
DOI: 10.3390/app11052378
|View full text |Cite
|
Sign up to set email alerts
|

Machine Learning-Based Identification of the Strongest Predictive Variables of Winning and Losing in Belgian Professional Soccer

Abstract: This study aimed to identify the strongest predictive variables of winning and losing in the highest Belgian soccer division. A predictive machine learning model based on a broad range of variables (n = 100) was constructed, using a dataset consisting of 576 games. To avoid multicollinearity and reduce dimensionality, Variance Inflation Factor (threshold of 5) and BorutaShap were respectively applied. A total of 13 variables remained and were used to predict winning or losing using Extreme Gradient Boosting. T… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
25
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 31 publications
(27 citation statements)
references
References 48 publications
1
25
0
1
Order By: Relevance
“…To prepare the input features for analysis, categorical variables were converted into dummy variables using OneHotEncoder, a scikit-learn (version 1.0.1) preprocessing package in Python. To avoid collinearity effect between the input variables, a Variance Inflation Factor analysis was conducted (threshold = 5) [ 26 ].…”
Section: Methodsmentioning
confidence: 99%
“…To prepare the input features for analysis, categorical variables were converted into dummy variables using OneHotEncoder, a scikit-learn (version 1.0.1) preprocessing package in Python. To avoid collinearity effect between the input variables, a Variance Inflation Factor analysis was conducted (threshold = 5) [ 26 ].…”
Section: Methodsmentioning
confidence: 99%
“…Feature selection approaches are essential components of the model designing phase to achieve the optimum performance of a forecast model. The Python-based BorutaShap algorithm remarkably eliminates irrelevant and largely redundant features, as revealed in a study where it was employed in identifying the strongest data series of winning and losing the Belgian professional soccer [22]. Along with utilizing selective filtering, the application of robust data decomposition schemes such as SWT efficiently accomplishes dimensionality reduction of the input variables.…”
Section: Related Workmentioning
confidence: 99%
“…It is highly compatible and facilitates any tree-based learner such as RF, XGBoost, decision tree (DT), etc. as the base model [22], [32]. To select the most significant features, the Boruta algorithm creates shadow features (exact replicas) of each feature and shuffles the values in the shadowed features to remove their correlations with the response variable [33].…”
Section: B Wrapper-based Borutashapmentioning
confidence: 99%
“…Such datasets, i.e., high-dimension low-sample size (HDLSS), are very common in clinical settings [36], [37] and are known to present several statistical challenges [38][39][40][41]. Machine learning (ML) techniques offer several tools to handle these challenges and have been used extensively for high-dimensionality problems [42][43][44][45][46][47]. When the sample size is small, feature selection is a crucial data preprocessing step that allows choosing the variables that contribute the most to the target effect while minimizing possible redundancies.…”
Section: Introductionmentioning
confidence: 99%