Over the years, Twitter has become a popular platform for information dissemination and information gathering. However, the popularity of Twitter has attracted not only legitimate users but also spammers who exploit social graphs, popular keywords, and hashtags for malicious purposes. In this paper, we present a detailed analysis of the HSpam14 dataset, which contains 14 million tweets with spam and ham (i.e., nonspam) labels, to understand spamming activities on Twitter. The primary focus of this paper is to analyze various aspects of spam on Twitter based on hashtags, tweet contents, and user profiles, which are useful for both tweet‐level and user‐level spam detection. First, we compare the usage of hashtags in spam and ham tweets based on frequency, position, orthography, and co‐occurrence. Second, for content‐based analysis, we analyze the variations in word usage, metadata, and near‐duplicate tweets. Third, for user‐based analysis, we investigate user profile information. In our study, we validate that spammers use popular hashtags to promote their tweets. We also observe differences in the usage of words in spam and ham tweets. Spam tweets are more likely to be emphasized using exclamation points and capitalized words. Furthermore, we observe that spammers use multiple accounts to post near‐duplicate tweets to promote their services and products. Unlike spammers, legitimate users are likely to provide more information such as their locations and personal descriptions in their profiles. In summary, this study presents a comprehensive analysis of hashtags, tweet contents, and user profiles in Twitter spamming.
Most existing techniques for spam detection on Twitter aim to identify and block users who post spam tweets. In this paper, we propose a Semi-Supervised Spam Detection (S 3 D) framework for spam detection at tweet-level. The proposed framework consists of two main modules: spam detection module operating in real-time mode, and model update module operating in batch mode. The spam detection module consists of four light-weight detectors: (i) blacklisted domain detector to label tweets containing blacklisted URLs, (ii) near-duplicate detector to label tweets that are nearduplicates of confidently pre-labeled tweets, (iii) reliable ham detector to label tweets that are posted by trusted users and that do not contain spammy words, and (iv) multi-classifier based detector labels the remaining tweets. The information required by the detection module are updated in batch mode based on the tweets that are labeled in the previous time window. Experiments on a large scale dataset show that the framework adaptively learns patterns of new spam activities and maintain good accuracy for spam detection in a tweet stream.
Presence of hyperlink in a tweet is a strong indication of tweet being more informative. In this paper, we study the problem of hashtag recommendation for hyperlinked tweets (i.e., tweets containing links to Web pages). By recommending hashtags to hyperlinked tweets, we argue that the functions of hashtags such as providing the right context to interpret the tweets, tweet categorization, and tweet promotion, can be extended to the linked documents. The proposed solution for hashtag recommendation consists of two phases. In the first phase, we select candidate hashtags through five schemes by considering the similar tweets, the similar documents, the named entities contained in the document, and the domain of the link. In the second phase, we formulate the hashtag recommendation problem as a learning to rank problem and adopt RankSVM to aggregate and rank the candidate hashtags. Our experiments on a collection of 24 million tweets show that the proposed solution achieves promising results.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.