Twitter is an online micro-blogging platform through which one can explore the hidden valuable and delightful information about the current context at any point of time, which also serves as a data source to carry out sentiment analysis. In this paper, the sentiments of large amount of tweets generated from Twitter in the form of big data have been analyzed using machine learning algorithms. A multi-tier architecture for sentiment classification is proposed in this paper, which includes modules such as tokenization, data cleaning, preprocessing, stemming, updated lexicon, stopwords and emoticon dictionaries, feature selection and machine learning classifier. Unigram and bigrams have been used as feature extractors together with χ2 (Chi-squared) and Singular Value Decomposition for dimensionality reduction together with two model types (Binary and Reg), with four types of scaling methods (No scaling, Standard, Signed and Unsigned) and represented them in three different vector formats (TF-IDF, Binary and Int). Accuracy is considered as the evaluation standard for random forest and bagged trees classification methods. Sentiments were analyzed through tokenization and having several stages of pre-processing and several combinations of feature vectors and classification methods. Through which it was possible to achieve an accuracy of 84.14%. Obtained results conclude that, the proposed scheme gives a better accuracy when compared with existing schemes in the literature.
In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention system, firewall etc. In recent times, Apache Spark in combination with Hadoop Yet-Another-Resource-Negotiator (YARN) is evolving as a generic Big Data processing platform. While processing raw network packets, timely inference of network security is a primitive requirement. However, to the best of our knowledge, no prior work has focused on systematic study on fine-tuning the resources, scalability and performance of distributed Apache Spark cluster (while processing PCAP data). For obtaining best performance, various cluster parameters like number of cluster nodes, number of cores utilized from each node, total number of executors run in the cluster, amount of main-memory used from each node, executor memory overhead allotted for each node to handle garbage collection issue, etc., have been finetuned, which is the focus of the proposed work. Through the proposed strategy, we could analyze 85GB of data (provided by CSIR Fourth Paradigm Institute) in just 78 seconds, using 32 node (256 cores) Spark cluster. This would otherwise take around 30 minutes in traditional processing systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.