2020
DOI: 10.1109/access.2020.2991800
|View full text |Cite
|
Sign up to set email alerts
|

Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data

Abstract: It is recognized the importance of knowing the descriptive properties of a dataset when tackling a data science problem. Having information about the redundancy, complexity and density of a problem allows us to make decisions as to which data preprocessing and machine learning techniques are most suitable. In classification problems, there are multiple metrics to describe the overlapping of the features between classes, class imbalances or separability, among others. However, these metrics may not scale up wel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(11 citation statements)
references
References 29 publications
0
10
0
1
Order By: Relevance
“…Regardless of which data reduction method is applied, the actual benefits of its application in the data science life-cycle are alleviating storage and memory requirements, the complexity of algorithmic computation, and runtimes, in addition to the fact that they may obtain better quality data by removing conceptual redundancy [9].…”
Section: Data Reductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Regardless of which data reduction method is applied, the actual benefits of its application in the data science life-cycle are alleviating storage and memory requirements, the complexity of algorithmic computation, and runtimes, in addition to the fact that they may obtain better quality data by removing conceptual redundancy [9].…”
Section: Data Reductionmentioning
confidence: 99%
“…For this purpose, different paradigms of classification algorithms exist, and they are used in multiple application fields [7,8]. When analyzing the publicly available tabular big data problems for binary classification, the presence of a notable conceptual redundancy of information in the data might be observed [9] (redundant instances and/or features) that leads to an unnecessary computational cost. Furthermore, it is known that the classifiers only need a set of instances that correctly represent a problem in order to generate an adequate model [10].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The reason is that there are many different kinds of data types or sources, and the unknown relationship between various variables is quite complicated and not easy to be discovered. This leads to the difficulty for reasonable interpretation on the basis of the conventional methods for univariate or multivariate analysis [ 9 , 10 ].…”
Section: Introductionmentioning
confidence: 99%
“…La falta de métodos de bajo-muestreo para enfrentar el problema de desbalance de clases en contexto de Big Data deja una amplia brecha para desarrollar propuestas. Al respecto, la reducción del tamaño del conjunto de datos mediante la selección inteligente de instancias permite tener un mejor rendimiento en modelos de aprendizaje con la característica de usar un número reducido de datos, que de acuerdo al estudio presentado por Maillo et al [89] no es indispensable contar con un número elevado de instancias para generar resultados de clasificación elevados.…”
Section: Propuesta: Tratamiento Del Desbalance De Clases En Big Data Basado En Grafosunclassified