Abstract. Despite several benefits to modern communities and businesses, Twitter has attracted many spammers overwhelming legitimate users with unwanted and disruptive advertising and fake information. Detecting spammers is always challenging because there is a huge volume of data that needs to be analyzed while at the mean time spammers continue learning and changing their ways to avoid being detected by anti-spammer systems. Several spam classification systems are proposed using various features extracted from the content and user's information from their Tweets. Nevertheless, no comprehensive study has been done to compare and evaluate the effectiveness and efficiency of these systems. It is not known what the best anti-spammer system is and why. This paper proposes an evaluation framework that allows researchers, developers, and practitioners to access existing user-based and content-based features, implement their own features, and evaluate the performance of their systems against other systems. Our framework helps identify the most effective and efficient spammer detection features, evaluate the impact of using different numbers of recent tweets, and therefore obtaining a faster and more accurate classifier model.
Keywords: Spam detection; Evaluation workbench; Feature selection; Machine learning
IntroductionSpams are unwanted activities such as when marketers send members unwanted advertisements, post fake reviews, or steal user information by directing users to malicious external pages [11]. As Social Network Services (SNS) becoming an important mode of communication, it attracts spammers who overwhelms users with unwanted content. Among these sites, Twitter, which was started in 2006, has grown to be one of the most popular SNS [22]. There are 500 million number of messages (called tweets) produced by 328 million active Twitter users (called twitterers) every day. Unlike other popular SNS, tweets can be read by anyone and people can follow a user without their consent. To attract users to their target websites, spammers post a large 2 number of coordinated messages containing specific URLs and sometimes describing them with unrelated words [26]. Because SNS helps build intrinsic trust between their users, 45% of them will click on links posted by their online friends even though they do not know those people in real life [24]. Twitterers also tend to post shortened URLs and write in abbreviated forms that rarely appear in conventional text documents or e-mails as a tweet can only contain up to 140 characters. Consequently, it is difficult for users to know the source URL and identify the content of the URL without clicking the link and loading the page. The noisy, unstructured, and informal expressions, such as "2mo is a new daaaaay!" or "TIL DC Comics stands for Detective Comics", used in the text also made it difficult for automatic spam detection system to accurately identify the semantic meaning of the tweets. Hence, social spamming is more harmful and complex than SMS, email or Web spams. It is bec...