A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal

Corrales, David Camilo; Ledezma, Agapito; Corrales, Juan Carlos

doi:10.17706/jcp.10.6.396-405

Cited by 15 publications

(15 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, the authors in [15] built a conceptual framework based on data quality issues mentioned in data mining methodologies such as CRISP-DM [8], SEMMA [39], KDD [7] and the Data Science Process [40]. Subsequently, the same authors [37] designed a data cleaning process in regression models.…”

Section: Data Quality Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

From Theory to Practice: A Data Quality Framework for Classification Tasks

2018

Self Cite

View full text Add to dashboard Cite

Abstract:The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.

show abstract

Section: Data Quality Frameworkmentioning

confidence: 99%

“…The rules were constructed based on literature reviews about data cleaning tasks [15,52,[90][91][92][93][94]. The most representative rules are explained below.…”

Section: Class Name Class Attributes Instancesmentioning

confidence: 99%

From Theory to Practice: A Data Quality Framework for Classification Tasks

2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Rotation forest (Corrales, Ledezma, & Corrales, 2015a) refers to a technique to generate an ensemble of classifiers, in which each base classifier is trained with a different set of extracted attributes. The main heuristic is to apply feature extraction and to subsequently reconstruct a full attribute set for each classifier in the ensemble.…”

Section: Rotation Forestmentioning

confidence: 99%

Two-Level Classifier Ensembles for Coffee Rust Estimation in Colombian Crops

Corrales

Casas

Ledezma

et al. 2016

International Journal of Agricultural and Environmental Information Systems

Self Cite

View full text Add to dashboard Cite

Rust is a disease that leads to considerable losses in the worldwide coffee industry. There are many contributing factors to the onset of coffee rust e.g. Crop management decisions and the prevailing weather. In Colombia the coffee production has been considerably reduced by 31% on average during the epidemic years compared with 2007. Recent research efforts focus on detection of disease incidence using simple classifiers. Authors in the computer field propose alternatives for improve the outcomes, making use of techniques that combine classifiers named ensemble methods. Therefore they proposed two-level classifier ensembles for coffee rust estimation in Colombian crops using Back Propagation Neural Networks, Regression Tree M5 and Support Vector Regression. Their ensemble approach outperformed the classical approaches as simple classifiers and ensemble methods in terms of Pearson's Correlation Coefficient, Mean Absolute Error and Root Mean Squared Error.

show abstract

“…The studies present different approaches to solve issues in data quality such as: heterogeneity, outliers, noise, inconsistency, incompleteness, amount of data, redundancy and timeliness [7][8]. We conduct a systematic review based on methodology [9], for each data quality issues, drawn from four informational sources: ieee Xplore, Science Direct, Springer Link and Google.…”

Section: Data Quality Issues In Knowledge Discovery Tasksmentioning

confidence: 99%

“…In this paper we present a systematic review for data quality issues in knowledge discovery tasks as: heterogeneity, outliers, noise, inconsistency, incompleteness, amount of data, redundancy and timeliness which are defined in [7][8] and a case study in agricultural diseases: the coffee rust. This paper is organized as follows.…”

Section: Introductionmentioning

confidence: 99%

A systematic review of data quality issues in knowledge discovery tasks

Corrales

Ledezma

Corrales

2016

Rev. ing. univ. Medellín

Self Cite

View full text Add to dashboard Cite

Large volume of data is growing because organizations are continuously capturing the collective amount of data for a better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust. ResumenHay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.

show abstract

A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal

Cited by 15 publications

References 24 publications

From Theory to Practice: A Data Quality Framework for Classification Tasks

From Theory to Practice: A Data Quality Framework for Classification Tasks

Two-Level Classifier Ensembles for Coffee Rust Estimation in Colombian Crops

A systematic review of data quality issues in knowledge discovery tasks

Contact Info

Product

Resources

About