The SECOM dataset contains information about a semiconductor production line, entailing the products that failed the in-house test line and their attributes. This dataset, similar to most semiconductor manufacturing data, contains missing values, imbalanced classes, and noisy features. In this work, the challenges of this dataset are met and many different approaches for classification are evaluated to perform fault diagnosis. We present an experimental evaluation that examines 288 combinations of different approaches involving data pruning, data imputation, feature selection, and classification methods, to find the suitable approaches for this task. Furthermore, a novel data imputation approach, namely "In-painting KNN-Imputation" is introduced and is shown to outperform the common data imputation technique. The results show the capability of each classifier, feature selection method, data generation method, and data imputation technique, with a full analysis of their respective parameter optimizations.Big Data Cogn. Comput. 2018, 2, 30 2 of 20 of fault detection and diagnosis. Previous work has been done to classify this dataset to handle the imbalanced data and irrelevant features. In [1,2], the challenge of imbalanced data was evaluated and approaches for oversampling the minority distribution to create balance between the classes was introduced. In [3], the challenge of imbalanced data was evaluated from an under-sampling perspective as well, showing that oversampling performs better on this dataset. In [2,4-6] different approaches for feature selection were proposed to rise to the challenge of noisy features. The challenge of missing data remains unexplored in the SECOM dataset. To the authors' knowledge, no literature thoroughly classifies the SECOM dataset after performing specialized data imputation.In this work the SECOM dataset is classified using a plethora of combination of approaches. The three challenges of data imbalance, missing data, and noisy data are handled via synthetic data generation, data imputation, and feature selection, respectively. Moreover, different classifiers are evaluated, and their performance is analyzed based on the task at hand. To face the challenge of missing data, a novel data imputation technique called "In-painting KNN-Imputation" is introduced which is inspired by image in-painting. In the end, leveraging the feature importance delivered by the classification model, fault diagnosis is performed, that demonstrates which features and measured parameters during the manufacturing process have high effect on the failure of the device.The rest of the paper is organized as follows. In Section 2 the background knowledge is provided. In this section, the semiconductor manufacturing, the manufacturing processes in this industry are conversed. The methodology of our work including the classification stages, the processes of data preparation, and procedures of constructing, evaluating, and interpreting the model for the data are discussed in Section 3. The designed and executed ex...