Recently, online social network (OSN) such as Twitter has become an important and popular source for real-time information and news dissemination, and Twitter is inevitably a prime target of spammers. It has been showed that the security threats caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the damage caused by Twitter spam, machine learning classification algorithms have been employed by researchers and communities to detect the Twitter spam. However, most of these studies have overlooked the class imbalance problem in Twitter spam detection. In this paper, we have studied the class imbalance problem in Twitter spam detection. Firstly, we have conducted a comparative study regarding some popular methods in handling the class imbalance problem in order to identify the most effective approach for addressing the class imbalance problem. Then, we have conducted another comparative study from Twitter spam detection based on several classic techniques. Experimental results demonstrate that a fuzy-based ensemble learning can significantly improve the classification performance on imbalance ground truth Twitter data.
KEYWORDSclassification, class imbalance, online social network, Twitter spam detection
INTRODUCTIONTwitter is used to exchange messages among friends. Unfortunately, spammers usually use Twitter as a tool to post unsolicited messages that contain malicious links, and even hijack trending topics. In this respect, the exponential growth of Twitter contributes to the increase of online spamming activities. Study shows that more than 3% messages are most probably abused by spammers. 1 In order to solve the security threats caused by spammers, a lot of researchers have proposed machine learning based algorithm for Twitter spam detection. However, most of these studies have neglected a fundamental issue that is the class imbalance problem, which is widely seen in real-world Twitter data. [2][3][4] The class imbalance problem has been identified as one of the ten challenging problems in data mining research. 5 This issue occurs in two different types of data sets: binary and multiclass. For binary problem, the training data from the minority class or positive class are very small, and the rest which make up the majority class or negative class are very large. While for multiclass problems, each of the class only contains a tiny fraction of samples. These problems are also especially critical in many real-world applications. For example, in Twitter spam detection, we used to have a large amount of normal Twitter data while only small number of spam samples, this gives us imbalanced data. Previous study has shown that the detection rate for Twitter spam can be decreased for about 33% in average with the class imbalance rate rises from 2 to 20. 6 Hence, a natural question in data mining research is how to improve the performance of classifiers facing with imbalanced data?Existing techniques for handling the class imbalance problem are mainly from three perspectives, includi...