Twitter is a social networking website that has gained a lot of popularity around the world in the last decade. This popularity made Twitter a common target for spammers and malicious users to spread unwanted advertisements, viruses and phishing attacks. In this article, we review the latest research works to determine the most effective features that were investigated for spam detection in the literature. These features are collected to build a comprehensive data set that can be used to develop more robust and accurate spammer detection models. The new data set is tested using popular classifiers (Naive Bayes, support vector machines, multilayer perceptron neural networks, Decision Trees, Random forests and k-Nearest Neighbour). The prediction performance of these classifiers is evaluated and compared based on different evaluation metrics. Moreover, a further analysis is carried out to identify the features that have higher impact on the accuracy of spam detection. Three different techniques are used and compared for this analysis: change of mean square error (CoM), information gain (IG) and Relief-F method. Top five features identified by each technique are used again to build the detection models. Experimental results show that most of the developed classifiers obtained high evaluation results based on the comprehensive data set constructed in this work. Experiments also reveal the important role of some features like the reputation of the account, average length of the tweet, average mention per tweet, age of the account, and the average time between posts in the process of identifying spammers in the social network.
Twitter is one of the most popular microblogging and social networking platforms where massive instant messages (i.e. tweets) are posted every day. Twitter sentiment analysis tackles the problem of analyzing users’ tweets in terms of thoughts, interests and opinions in a variety of contexts and domains. Such analysis can be valuable for several researchers and applications that require understanding people views about a particular topic or event. The study carried out in this paper provides an overview of the algorithms and approaches that have been used for sentiment analysis in twitter. The reviewed articles are categories into four categories based on the approach they use. Furthermore, we discuss directions for future research on how twitter sentiment analysis approaches can utilize theories and technologies from other fields such cognitive science, semantic Web, big data and visualization.
Spam is no longer just commercial unsolicited email messages that waste our time, it consumes network traffic and mail servers' storage. Furthermore, spam has become a major component of several attack vectors including attacks such as phishing, cross-site scripting, cross-site request forgery and malware infection. Statistics show that the amount of spam containing malicious contents increased compared to the one advertising legitimate products and services. In this paper, the issue of spam detection is investigated with the aim to develop an efficient method to identify spam email based on the analysis of the content of email messages. We identify a set of features that have a considerable number of malicious related features. Our goal is to study the effect of these features in helping the classical classifiers in identifying spam emails. To make the problem more challenging, we developed spam classification models based on imbalanced data where spam emails form the rare class with only 16.5% of the total emails. Different metrics were utilized in the evaluation of the developed models. Results show noticeable improvement of spam classification models when trained by dataset that includes malicious related features.
This paper aims to design and implement an automatic heart disease diagnosis system using MATLAB. The Cleveland data set for heart diseases was used as the main database for training and testing the developed system. In order to train and test the Cleveland data set, two systems were developed. The first system is based on the Multilayer Perceptron (MLP) structure on the Artificial Neural Network (ANN), whereas the second system is based on the Adaptive Neuro-Fuzzy Inference Systems (ANFIS) approach. Each system has two main modules, namely, training and testing, where 80% and 20% of the Cleveland data set were randomly selected for training and testing purposes respectively. Each system also has an additional module known as case-based module, where the user has to input values for 13 required attributes as specified by the Cleveland data set, in order to test the status of the patient whether heart disease is present or absent from that particular patient. In addition, the effects of different values for important parameters were investigated in the ANN-based and Neuro-Fuzzy-based systems in order to select the best parameters that obtain the highest performance. Based on the experimental work, it is clear that the Neuro-Fuzzy system outperforms the ANN system using the training data set, where the accuracy for each system was 100% and 90.74%, respectively. However, using the testing data set, it is clear that the ANN system outperforms the Neuro-Fuzzy system, where the best accuracy for each system was 87.04% and 75.93%, respectively. M. A. M. Abushariah et al. 1056
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.