Big data classification is the study of how to classify large amounts of data, which conventional data mining methods typically find challenging to handle. The most popular data mining technique is the K-Nearest Neighbor(KNN) classifier because of its efficiency and simplicity. The sequential KNN classifier can't handle a huge amount of data due to its highly required calculation nature, so it is improved by using a parallel technique supported by Map Reduce. This model enables us to classify the massive amount of data that exceeds terabytes. Similar to the original KNN model, this parallel implementation offers the same classification rate but less time complexity. In this paper, parallel KNN with Map Reduce has been proposed using a Hadoop multi-data node cluster. First, the dataset split into n blocks, and in each node, the mapper and reducer will be executed. The mapper phase is responsible for calculating the Euclidean distance between the training set and the targeted point, and the output of the mapper is a set of pairs of <Distance, Class> that serve as the reducer's input. In reducer, the minimum k distances are determined, and the class with maximum occurrence will represent the predicted class.
The results showed a significant improvement in time complexity for the proposed approach over the traditional one. A New York criminal data set with a size of 6.5 million records was used in this work. The tuples Latitude and Longitude were used to determine the nearest neighbors, while the Patrol-BORO was used as class label.