During the previous decades, intelligent identification of acronym and expansion pairs from a large corpus has garnered considerable research attention, particularly in the fields of text mining, entity extraction, and information retrieval. Herein, we present an improved approach to recognize the accurate acronym and expansion pairs from a large Indonesian corpus. Generally, an acronym can be either a combination of uppercase letters or a sequence of speech sounds (syllables). Our proposed approach can be computationally divided into four steps: (1) acronym candidate identification;(2) acronym and expansion pair collection; (3) feature generation; and (4) acronym and expansion pair recognition using supervised learning techniques. Further, we introduce eight numerical features and evaluate their effectiveness in representing the acronym and expansion pairs based on the precision, recall, and F-measure. Furthermore, we compare the k-nearest neighbors (K-NN), support vector machine (SVM), and bidirectional encoder representations from transformers (BERT) algorithms in terms of accurate acronym and expansion pair classification. The experimental results indicate that the SVM polynomial model that considers eight features exhibits the highest accuracy (97.93%), surpassing those of the SVM polynomial model that considers five features (90.45%), the K-NN algorithm with k = 3 that considers eight features (96.82%), the K-NN algorithm with k = 3 that considers five features (95.66%), BERT-Base model (81.64%), and BERT-Base Multilingual Cased model (88.10%). Moreover, we analyze the performance of the Hadoop technology using various numbers of data nodes to identify the acronym and expansion pairs and obtain their feature vectors. The results reveal that the Hadoop cluster containing a large number of data nodes is faster than that with fewer data nodes when processing from ten million to one hundred million pairs of acronyms and expansions.Information 2020, 11, 210 2 of 13 predicted to considerably increase with the increasing number of smartphone users, which is predicted to reach 6.1 billion users in 2020 [9]. Furthermore, new online trading trends have contributed to the rapid accumulation of records in databases by ecommerce companies, such as Alibaba and Amazon, which generate and store several terabytes of data every day [7]. The analysis of a large amount of data requires machine learning techniques to automate the creation of analytical models based on historical data and then use the model for learning from the data [10], discovering useful patterns [11], and performing automated decisions with little human intervention [12]. Many queries are posed by millions of users from across the globe each day on Google's search engine, which has attracted considerable attention from researchers who have analyzed the query logs using machine learning techniques to track and predict phenomena, including the spread patterns of flu symptoms in the United States [5,6]. The web search queries are considered to be less biased a...