Metadata Quality in the Era of Big Data and Unstructured Content

Elouataoui, Widad; Alaoui, Imane El; Gahi, Youssef

doi:10.1007/978-3-030-91738-8_11

Cited by 10 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, clean, accurate data can reduce the need for manual data cleaning and processing, saving time and resources. Moreover, poor-quality data can be costly for organizations [8], leading to additional data cleaning, correction, and rework costs. Organizations can avoid these costs by ensuring data quality and using their resources better.…”

Section: A the Importance Of Data Qualitymentioning

confidence: 99%

Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis

Widad,

Saida,

Gahi

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

The increasing reliance on Big Data analytics has highlighted the critical role of data quality in ensuring accurate and reliable results. Consequently, organizations aiming to leverage the power of Big Data recognize the crucial role of data quality as an integral component. One notable type of data quality anomaly observed in big datasets is the presence of outlier values. Detecting and addressing these outliers have become a subject of interest across diverse domains, leading to the development of numerous anomaly detection approaches. Although anomaly detection has witnessed a proliferation of practices in recent years, a significant gap remains in addressing anomalies related to the other aspects of data quality. Indeed, while most approaches focus on identifying anomalies that deviate from the expected patterns, they do not consider irregularities in data quality, such as missing, incorrect, or inconsistent data. Moreover, most of approaches are domain-correlated and lack the capability to detect anomalies in a generic manner. Thus, we aim through this paper to address this gap in the field and provide a holistic and effective solution for Big Data quality anomaly detection. To achieve this, we suggest a novel approach that allows a comprehensive detection of Big Data quality anomalies related to six quality dimensions: Accuracy, Consistency, Completeness, Conformity, Uniqueness, and Readability. Moreover, the framework allows for sophisticated detection of generic data quality anomalies through the implementation of an intelligent anomaly detection model without any correlation to a specific field. Furthermore, we introduce and measure a new metric called "Quality Anomaly Score," which refers to the degree of anomalousness of the quality anomalies of each quality dimension and the entire dataset. Through the implementation and evaluation of our framework, the suggested framework has achieved an accuracy score of up to 99.91% and an F-score of 98.07%.

show abstract

Section: A the Importance Of Data Qualitymentioning

confidence: 99%

Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis

Widad,

Saida,

Gahi

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…In particular, the quality of metadata has a significant impact on the reusability of data (Elouataoui et al, 2022;Kindling and Strecker, 2022). However, the requirements of high quality metadata can be guided by repositories (Trisovic et al, 2021).…”

Section: Data Repositoriesmentioning

confidence: 99%

On the Development of a Dataset Publication Guideline: Data Repositories and Keyword Analysis in Isprs Domain

Budde,

Kullmann,

Iwaszczuk

2023

Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci.

View full text Add to dashboard Cite

Abstract. The FAIR principle (find, access, interoperability, reuse) forms a sustainable resource for scientific exchange between researchers. Currently, the implementation of this principle is an important process for future research projects. To support this process in the ISPRS community, the usage of data repositories for dataset publication has the potential to bring closer the achievement of the FAIR principle. Therefore, we (1) analysed available data repositories, (2) identified common keywords in ISPRS publications and (3) developed a tool for searching appropriate repositories. Thus, infrastructures from the field of geosciences, that can already be used, become more accessible.

show abstract

“…This process is known as Data Deduplication. Data duplication can occur for different reasons, such as data integration, where data are gathered from multiple data sources so the same information can be recorded more than once in another format [21] [22]. Also, data duplication could be related to human errors, so the same person, for example, could provide data with slightly different information intentionally or by mistake multiple times.…”

Section: Big Data Deduplicationmentioning

confidence: 99%

An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

Elouataoui¹,

Alaoui²,

Mendili³

et al. 2022

IJACSA

Self Cite

View full text Add to dashboard Cite

While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple deduplication approaches were suggested. However, few efforts have been made to address deduplication issues in Big Data Context. Also, the existing big data deduplication approaches are not handling the case of the decreasing performance of the deduplication model during the serving. In addition, most current methods are limited to duplicate detection, which is part of the deduplication process. Therefore, we aim through this paper to propose an End-to-End Big Data Deduplication Framework based on a semi-supervised learning approach that outperforms the existing big data deduplication approaches with an F-score of 98,21%, a Precision of 98,24% and a Recall of 96,48%. Moreover, the suggested framework encompasses all data deduplication phases, including data preprocessing and preparation, automated data labeling, duplicate detection, data cleaning, and an auditing and monitoring phase. This last phase is based on an online continual learning strategy for big data deduplication that allows addressing the decreasing performance of the deduplication model during the serving. The obtained results have shown that the suggested continual learning strategy has increased the model accuracy by 1,16%. Furthermore, we apply the proposed framework to three different datasets and compare its performance against the existing deduplication models. Finally, the results are discussed, conclusions are made, and future work directions are highlighted.

show abstract

Metadata Quality in the Era of Big Data and Unstructured Content

Cited by 10 publications

References 19 publications

Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis

Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis

On the Development of a Dataset Publication Guideline: Data Repositories and Keyword Analysis in Isprs Domain

An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

Contact Info

Product

Resources

About