2020
DOI: 10.1186/s12874-020-01080-1
|View full text |Cite
|
Sign up to set email alerts
|

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Abstract: Background: Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Methods: To examine the effects of th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

5
63
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 140 publications
(83 citation statements)
references
References 18 publications
5
63
0
Order By: Relevance
“…However, as the missing percentage grew, datasets imputed using missForest quickly became the most biased. This nding is consistent with previous evidence that implementing individual tree estimation through missForest could systematically lead to biased estimates, especially for non-normal, skewed data such as the count data that we had in our study [30]. Further, we found that the bias in imputed values became more aggravated as more missing values were introduced into the data.…”
Section: Discussionsupporting
confidence: 92%
See 1 more Smart Citation
“…However, as the missing percentage grew, datasets imputed using missForest quickly became the most biased. This nding is consistent with previous evidence that implementing individual tree estimation through missForest could systematically lead to biased estimates, especially for non-normal, skewed data such as the count data that we had in our study [30]. Further, we found that the bias in imputed values became more aggravated as more missing values were introduced into the data.…”
Section: Discussionsupporting
confidence: 92%
“…For example, Waljee et al [4] have demonstrated the superiority of the local random forest method (e.g., missForest) in imputing missing laboratory values, while Hong & Lynn [5] have pointed out that the use of imputed variables from random forest-based approaches could lead to severely biased inference in a simulation study. Studies in other settings suggest that the results generated from multiple imputation are unbiased and could more closely mimic the true data [6,7].…”
Section: Introductionmentioning
confidence: 99%
“…However, the performance of these methods is not well stated. It depends on the presence and relevance of possible interaction effects and on the correlation structure of the data, and it could be quite poor when data are highly skewed [42][43][44]. It would be interesting to investigate the robustness to departure from the MAR assumption for multiple imputations approaches based on recursive partitioning.…”
Section: Discussionmentioning
confidence: 99%
“…Moreover, the algorithm is able to manage missing values which are common in clinical studies. The "on-the-fly-imputation" algorithm (Hong and Lynn, 2020) imputes data when it grows the forest handling interactions and non-linearity in the dataset.…”
Section: Methodsmentioning
confidence: 99%