Purpose
The purpose of this study is to enhance data quality and overall accuracy and improve certainty by reducing the negative impacts of the FCM algorithm while clustering real-world data and also decreasing the inherent noise in data sets.
Design/methodology/approach
The present study proposed a new effective model based on fuzzy C-means (FCM), ensemble filtering (ENS) and machine learning algorithms, called an FCM-ENS model. This model is mainly composed of three parts: noise detection, noise filtering and noise classification.
Findings
The performance of the proposed model was tested by conducting experiments on six data sets from the UCI repository. As shown by the obtained results, the proposed noise detection model very effectively detected the class noise and enhanced performance in case the identified class noisy instances were removed.
Originality/value
To the best of the authors’ knowledge, no effort has been made to improve the FCM algorithm in relation to class noise detection issues. Thus, the novelty of existing research is combining the FCM algorithm as a noise detection technique with ENS to reduce the negative effect of inherent noise and increase data quality and accuracy.
Real data may have a considerable amount of noise produced by error in data collection, transmission and storage. The noisy training data set increases the training time and complexity of the induced machine learning model, which led to reduce the overall performance. Identifying noisy instances and then eliminating or correcting them are useful techniques in data mining research. This paper investigates misclassified instances issues and proposes a clustering-based and classification filtering algorithm (CLCF) in noise detection and classification model. It applies the k-means clustering technique for noise detection, and then five different classification filtering algorithms are applied for noise filtering.It also employs two well-known techniques for noise classification, namely, removing and relabeling. To evaluate the performance of the CLCF model, several experiments were conducted on four binary data sets. The proposed technique was found to be successful in classify class noisy instances, which is significantly effective for decision making system in several domains such as medical areas. The results shows that the proposed model led to a significant performance improvement compared with before performing noise filtering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.