Multiple Imputation Through XGBoost

Deng, Yongshi; Lumley, Thomas

doi:10.48550/arxiv.2106.01574

Cited by 4 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MICE with default settings (van Buuren and Groothuis Oudshoorn, 2011) would produce unsatisfactory results unless users manually specify any potential non-linear or interaction effects in the imputation model for each incomplete variable. However, researchers often use MICE in an automated way (Deng and Lumley, 2021).…”

Section: Micementioning

confidence: 99%

“…Nonparametric imputation can avoid selecting a distribution by using machine learning. missForest (Stekhoven and Bűhlmann, 2012) uses a random forest and mixgb (Deng and Lumley, 2021) is based on XGBoost for the imputation.…”

Section: Introductionmentioning

confidence: 99%

“…imputed data matrix and a previous imputed data matrix, respectively.3.2.2. mixgbmixgb(Deng and Lumley, 2021) is an automated and fast multiple imputation through XGBoost. It can help automatically capture complex relations among variables and tackle the computational bottleneck problem of existing imputation methods.…”

mentioning

confidence: 99%

See 2 more Smart Citations

A comparison of imputation methods using machine learning models

Suh

Song

2023

Communications for Statistical Applications and Methods

View full text Add to dashboard Cite

Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.

show abstract

Section: Micementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A comparison of imputation methods using machine learning models

Suh

Song

2023

Communications for Statistical Applications and Methods

View full text Add to dashboard Cite

show abstract

“…This is also true for IRMI [26] and imputeRobust [24] from R package VIM. Imputation with random forests from R package ranger [16] and especially imputation with XG-Boost using the R package mixgb [9] are outperformed by GAM methods. GAMLSS with normal distribution (NO) performs better than with assumed t-distribution (TF).…”

Section: Visual Comparison Of a Single Imputationmentioning

confidence: 99%

Robust Multipe Imputation with GAM

Templ

2023

Preprint

View full text Add to dashboard Cite

Multiple imputation of missing values is a key step in data analytics and a standard process in data mining. Non-linear imputation methods ones comes into play whenever the linear relationship between a response and predictors cannot be linearized. One kind of popular non-linear methods are Generalized Additive Models (GAM) and an extension of GAM, namely GAMLSS, where each parameter of the distribution (e.g., mean, variance, skewness, kurtosis) can be modeled as a function of predictors. However, non-robust methods such as standard GAM's and GAMLSS's can be swayed by outliers, leading to outlier-driven imputations. This can apply concerning both representative outliers - those true yet unusual values of your population - and non-representative outliers, which are mere measurement errors. Robust (imputation) methods effectively manage outliers and exhibit remarkable resistance to their influence, providing a more reliable approach to dealing with missing data. A new robust imputation algorithm is introduced. This innovative solution addresses three significant challenges with robustness. (1) It uses a robust bootstrap to manage model uncertainty when imputing a random sample, (2) it incorporates robust fitting to reinforce accuracy, and (3) it takes into account imputation uncertainty in a resilient manner. Furthermore, any complex model for any variable with missingness can be considered and run through the algorithm. For the employed real-world datasets and the conducted simulation study, the novel algorithm imputeRobust demonstrates superior performance in comparison to other prevalent methods.

show abstract

“…Extreme gradient boosting (XGBoost) can also be used to use bootstrapping and predictive mean matching for the imputation of missing data [19]. When used under fully conditional specification (FCS), XGBoost imputation models are developed for each incomplete parameter.…”

Section: Extreme Gradient Boostingmentioning

confidence: 99%

Imputing Missing Data in Hourly Traffic Counts

Shafique

2022

Sensors

View full text Add to dashboard Cite

Hourly traffic volumes, collected by automatic traffic recorders (ATRs), are of paramount importance since they are used to calculate average annual daily traffic (AADT) and design hourly volume (DHV). Hence, it is necessary to ensure the quality of the collected data. Unfortunately, ATRs malfunction occasionally, resulting in missing data, as well as unreliable counts. This naturally has an impact on the accuracy of the key parameters derived from the hourly counts. This study aims to solve this problem. ATR data from New South Wales, Australia was screened for irregularities and invalid entries. A total of 25% of the reliable data was randomly selected to test thirteen different imputation methods. Two scenarios for data omission, i.e., 25% and 100%, were analyzed. Results indicated that missForest outperformed other imputation methods; hence, it was used to impute the actual missing data to complete the dataset. AADT values were calculated from both original counts before imputation and completed counts after imputation. AADT values from imputed data were slightly higher. The average daily volumes when plotted validated the quality of imputed data, as the annual trends demonstrated a relatively better fit.

show abstract

Multiple Imputation Through XGBoost

Cited by 4 publications

References 20 publications

A comparison of imputation methods using machine learning models

A comparison of imputation methods using machine learning models

Robust Multipe Imputation with GAM

Imputing Missing Data in Hourly Traffic Counts

Contact Info

Product

Resources

About