Real-time management model for frequent Big Data errors : Automatic Clean Repository For Big Data (ACR)

Snineh, Sidi Mohamed; Youssfi, Mohamed; Bouattane, Omar; Daaif, Abdelaziz; Abra, Oum El Kheir

doi:10.1109/icmcs.2018.8525920

Cited by 2 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Snineh et al 22 proposed a solution that can be performed in real time to handle the frequent errors of Big Data flows. They proposed a repository for each given domain in their two-step model to store the metadata, cleaning and correction algorithms, and an error log.…”

Section: Related Workmentioning

confidence: 99%

Smartic: A smart tool for Big Data analytics and IoT

Sayeed,

Ahmad,

Peng

2024

F1000Res

View full text Add to dashboard Cite

The Internet of Things (IoT) is leading the physical and digital world of technology to converge. Real-time and massive scale connections produce a large amount of versatile data, where Big Data comes into the picture. Big Data refers to large, diverse sets of information with dimensions that go beyond the capabilities of widely used database management systems, or standard data processing software tools to manage within a given limit. Almost every big dataset is dirty and may contain missing data, mistyping, inaccuracies, and many more issues that impact Big Data analytics performances. One of the biggest challenges in Big Data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics results and unpredictable conclusions. Different imputation methods were employed in the experimentation with various missing value imputation techniques, and the performances of machine learning (ML) models were compared. A hybrid model that integrates ML and sample-based statistical techniques for missing value imputation is being proposed. Furthermore, the continuation involved the dataset with the best missing value imputation, chosen based on ML model performance for subsequent feature engineering and hyperparameter tuning. K-means clustering and principal component analysis were applied in our study. Accuracy, the evaluated outcome, improved dramatically and proved that the XGBoost model gives very high accuracy at around 0.125 root mean squared logarithmic error (RMSLE). To overcome overfitting, K-fold cross-validation was implemented.

show abstract

Section: Related Workmentioning

confidence: 99%

Smartic: A smart tool for Big Data analytics and IoT

Sayeed,

Ahmad,

Peng

2024

F1000Res

View full text Add to dashboard Cite

show abstract

“…The results from their experiments show that anyone can retain a smart dataset efficiently from any Big Data classification problem using these proposed filters. Snineh et al 22 proposed a solution that can be performed in real time to handle the frequent errors of Big Data flows. They proposed a repository for each given domain in their two-step model to store the metadata, cleaning and correction algorithms, and an error log.…”

Section: Related Workmentioning

confidence: 99%

Smartic: A smart tool for Big Data analytics and IoT

2022

View full text Add to dashboard Cite

The Internet of Things (IoT) is leading the physical and digital world of technology to converge. Real-time and massive scale connections produce a large amount of versatile data, where Big Data comes into the picture. Big Data refers to large, diverse sets of information with dimensions that go beyond the capabilities of widely used database management systems, or standard data processing software tools to manage within a given limit. Almost every big dataset is dirty and may contain missing data, mistyping, inaccuracies, and many more issues that impact Big Data analytics performances. One of the biggest challenges in Big Data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics results and unpredictable conclusions. We experimented with different missing value imputation techniques and compared machine learning (ML) model performances with different imputation methods. We propose a hybrid model for missing value imputation combining ML and sample-based statistical techniques. Furthermore, we continued with the best missing value inputted dataset, chosen based on ML model performance for feature engineering and hyperparameter tuning. We used k-means clustering and principal component analysis. Accuracy, the evaluated outcome, improved dramatically and proved that the XGBoost model gives very high accuracy at around 0.125 root mean squared logarithmic error (RMSLE). To overcome overfitting, we used K-fold cross-validation.

show abstract

Real-time management model for frequent Big Data errors : Automatic Clean Repository For Big Data (ACR)

Cited by 2 publications

References 11 publications

Smartic: A smart tool for Big Data analytics and IoT

Smartic: A smart tool for Big Data analytics and IoT

Smartic: A smart tool for Big Data analytics and IoT

Contact Info

Product

Resources

About