A Survey On Missing Data in Machine Learning

Emmanuel, Tlamelo; Maupong, Thabiso; Mpoeleng, Dimane; Semong, Thabo; Mphago, Banyatsang; Tabona, Oteng

doi:10.21203/rs.3.rs-535520/v1

Cited by 7 publications

(14 citation statements)

References 106 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The common reasons for missing values (MVs) are diverse, including respondents in the household survey may refuse to report income; in industry experiments, some results are missing because of mechanical failures unrelated to the experimental process; in medical experiments, some participants drop out because of drug allergies, deaths or other reasons [1]. To sum up, these reasons can be roughly divided into four types, including (1) human mistakes when processing data, (2) machine error caused by equipment malfunction, (3) respondents' refusal to answer specific questions, (4) drop-out from studies and merging unrelated data [2][3][4]. Missing data is unavoidable, despite the fact that we are all aware that gathering as much data as possible is the ideal strategy for data analysis.…”

Section: Introductionmentioning

confidence: 99%

“…They outlined a few issues with these studies, including the small size of experimental datasets, and the lack of attention to missing mechanisms. Recently, Emmanuel et al [2] compiled some literature with a focus on machine learning methods. They tested with the KNN and random forest (RF) imputation techniques at the same time, however, they only employed two tiny datasets, the Iris and ID fan datasets [16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A review on missing values for main challenges and methods

Ren,

Wang,

Sekhari Seklouli

et al. 2023

Information Systems

View full text Add to dashboard Cite

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A review on missing values for main challenges and methods

Ren,

Wang,

Sekhari Seklouli

et al. 2023

Information Systems

View full text Add to dashboard Cite

“…This study aims to avoid this issue by using four imputation techniques based on kNN, sliding-windows (SW), regression (RI) and support vector machine-basis (SVMI) algorithms. Note that the mentioned imputation techniques are recently used in the literature (see (Malarvizhi & Thanamani, 2012) for kNN imputation, (Emmanuel et al, 2021) for SVMI, (Doreswamy & Manjunatha, 2017) for RI) and developed by the compilation of the missing data in general. In this paper, those methods are adapted to the right-censored data and the modelling procedure.…”

Section: Introductionmentioning

confidence: 99%

Regression with right-censored high-dimensional data: An application with different imputation techniques

Yılmaz

Aydın

Ahmed

2022

KJS publishes peer-review articles in Mathematics, Computer Sci

View full text Add to dashboard Cite

This study aims to introduce four modified linear estimators for the right-censored high-dimensional data. Obviously, data of interest involves two important problems to be solved that are censorship and high dimensionality. This paper can be distinguished from other studies in the literature with that it achieves to handle these two problems simultaneously. The main contribution of the paper is merging weightedridge method with the imputation techniques to obtain more efficient estimators than its alternatives. To solve the censorship problem, four imputation techniques are considered based on machine learning algorithms kNN, sliding-windows, regression and support vector machines. The high-dimensionality problem is handled by the weighted ridge approach which provides estimator with less risk than its alternatives because it detects the covariates with a weak contribution via the post-selection procedure. To show the empirical performance of the introduced estimators, a simulation study is made and comparative results are presented. Results show that kNN and regression imputation basis WR esitmators show satisfying performances on estimation of the high-dimensional right-censored model.

show abstract

“…For example, the training of a feedforward neural network requires complete inputs in order for the hidden layers to feed forward valid inputs during the forward pass and then update the weights appropriately during the backpropagation step (M. L. Brown & Kros, 2003). Therefore, it is not immediately obvious how one would use such a model in the second category of techniques (i.e., without interpolation of the gaps) and this remains an open problem in the machine learning community (Caiafa et al., 2021; Emmanuel et al., 2021; Sharpe & Solly, 1995). One solution is to use a Cosine Neural Network (Randolph‐Gips, 2008).…”

Section: Introductionmentioning

confidence: 99%

Exploring the Potential of Neural Networks to Predict Statistics of Solar Wind Turbulence

et al. 2022

View full text Add to dashboard Cite

Time series data sets often have missing or corrupted entries, which need to be handled in subsequent data analysis. For example, in the context of space physics, calibration issues, satellite telemetry issues, and unexpected events can make parts of a time series unusable. This causes problems for understanding the dynamics of the heliosphere and space weather environment. Various approaches exist to tackle this problem, including mean/median imputation, linear interpolation, and autoregressive modeling. Here, we study the utility of artificial neural networks (ANNs) to predict statistics of sparse time series. Our focus is not on time series prediction but on gleaning the best possible information about the statistical behavior of the system. As an example application, we focus on the structure functions of turbulent time series measured in the solar wind. Using a data set with artificial gaps, a neural network is trained to predict second‐order structure functions and then tested on an unseen data set to quantify its performance. A small feedforward ANN, with only 20 hidden neurons, can predict the large‐scale fluctuation amplitudes better than mean imputation or linear interpolation when the percentage of missing data is high. Although they perform worse than the other methods when it comes to capturing both the shape and fluctuation amplitude together, their performance is better in a statistical sense for large fractions of missing data. Caveats regarding their utility, the optimization procedure, and potential future improvements are discussed.

show abstract

A Survey On Missing Data in Machine Learning

Cited by 7 publications

References 106 publications

A review on missing values for main challenges and methods

A review on missing values for main challenges and methods

Regression with right-censored high-dimensional data: An application with different imputation techniques

Exploring the Potential of Neural Networks to Predict Statistics of Solar Wind Turbulence

Contact Info

Product

Resources

About