Big data classification is a challenging task because most known classification methods need a long time and a lot of processing resources to execute such a task and use the vast amount of available data. In this paper, we propose a novel big data classification method that leverages the power of the KNN classifier and the efficiency of the ensemble learning technique to create a new method capable of performing classification tasks on big data efficiently. The proposed method picks tiny data chunks at random from a big dataset, with each chunk including random examples of a small number of randomly selected features. A weak KNN classifier is employed on each data chunk to perform classification on new (unseen) data, and the majority voting rule is used to reach the final classification decision based on the outcomes of the weak classifiers. The proposed method has a constant classification time, according to the time complexity analysis. Furthermore, the proposed method was found to be more efficient on a single node than existing methods, some of which run on a large cluster of nodes. Because of its speed and enhanced performance, the proposed method can be considered an ideal classifier for handling complex data types such as Geospatial data, Big trajectory data, and Big Data in general.
INDEX TERMSBig data; Geospatial data; Trajectory data; Classification; Ensemble learning; KNN I. INTRODUCTIONFor decades, researchers have been studying and evaluating big data methodologies and tools. This interest stems from the massive volume of data exchanged and saved by social media users, medical organizations, educational institutions, and others.According to statistical reports, the number of users on different social media platforms has reached more than 2 billion [1]. WhatsApp, for example, has over 600 million users, more than half a billion photos, and one hundred million videos transferred and shared between users on a daily basis.[2]. Also, due to the huge advance in smartphone technology, it has become easier for users to share text\images and write posts on such social media platforms. Some reports show that the number of posts on Twitter in 2007 was 5K, this number became around 500 million after about 6 years, in 2013 [3], which indicates the massive amount of available data on social media in general. This amount of data is not restricted to social media, as many other platforms generate and store huge data volumes [4,5,6,7].This amount of data needs to be processed and analyzed in order to use it for building useful knowledge discovery and machine learning big data-based applications, like facial big data applications [8,9,10], signal big data [11] and various industry big data-based applications [12,13,14].Volume (big), Variety, and Velocity are among the most distinguishing characteristics of Big data, and as a result, it is attractive to have an efficient classification/prediction system to learn from such Big data. Such applications include, but are not limited to, medical [15,16], financial [17,18,19]...