A hybrid BSO-Chi2-SVM approach to Arabic text categorization

Belkebir, Riadh; Guessoum, Ahmed

doi:10.1109/aiccsa.2013.6616437

Cited by 22 publications

(19 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some researchers have investigated these approaches to classify Arabic content from websites [12]. For instance, techniques like weighting Arabic words from websites have been used to predict Arabic spammers on websites.…”

Section: B Spam Detection and Machine Learningmentioning

confidence: 99%

Detection of Abusive Accounts with Arabic Tweets

Abozinadah¹,

Mbaziira²,

Jones³

2015

IJKE

View full text Add to dashboard Cite

Abstract-Twitter is one of the most popular sources for disseminating news and propaganda in the Arab region. Spammers are now creating abusive accounts to distribute adult content in Arabic tweets, which is prohibited by Arabic norms and cultures. Arab governments are facing a massive challenge to detect these accounts. This paper evaluates different machine learning algorithms for detecting abusive accounts with Arabic tweets, using Naïve Bayes (NB), Support Vector Machine (SVM), and Decision Tree (J48) classifiers. We are not aware of another existing data set of abusive accounts with Arabic tweets, and this is the first study to investigate this issue. The data set for this analysis was collected based on the top five Arabic swearing words. The results show that the Naïve Bayes (NB) classifier with 10 tweets and 100 features has the best performance with 90% accuracy rate.Index Terms-Arabic text classification, machine learning, pornographic spam, social network abuse. I. INTRODUCTIONTwitter is a micro blogger provider where users compose messages of not more than 140 characters. These messages are called tweets, and may contain text, pictures, videos or hyperlinks. The usernames in Twitter start with a prefix (@). Twitter users create their social networks through followers and following relationships. Tweets will be posted on the user and the followers' timelines and can be found by Twitter's search engine. The tweets can be forwarded to the user's followers by clicking -Retweet‖. At the same time, the tweet can be replayed by including the username prefixed by @ in the tweet. The tweets' topics can be indexed using hashtags for each topic. All hashtags in Twitter are preceded with the hash (#) symbol and can also be searched through Twitter's search engine.Since the 2011 Arab spring, the number of Twitter users in Arab nations has been escalating. Twitter has registered five million active users in Arab countries, who send on average 17 million tweets a day. Twitter, like other social media, is a popular medium for disseminating news and propaganda Consequently, spammers are exploiting Twitter's popularity in the Middle East to disseminate malicious content. These mal-actors have opened up Twitter accounts to launch spamming campaigns targeting Arabic speakers within the 22 nations in the Middle East. Some of the Arab nations have attempted, but failed, to censor Internet traffic to block malicious URLs and contents from abusive social media accounts. These attempts have failed because spam detection tools trained in the English language are being implemented Manuscript received March 12, 2015; revised June 9, 2015. The authors are with Computer Science Department, George Mason University, Fairfax, VA 22030 USA (e-mail: eabozina@gmu.edu, ambaziir@gmu.edu, jjonesu@gmu.edu).on Arabic spam [4], [5]. Spammers are exploiting this loophole to launch successful spam campaigns.In the meantime, the number of abusive accounts has been increasing over time by exploiting the simplicity of using emails as a verificati...

show abstract

Section: B Spam Detection and Machine Learningmentioning

confidence: 99%

Detection of Abusive Accounts with Arabic Tweets

Abozinadah¹,

Mbaziira²,

Jones³

2015

IJKE

View full text Add to dashboard Cite

show abstract

“…Using FS, the discriminating power of each term is computed, and only the top-scoring ones are used to build the classifier. Several FS methods are used in the literature of Arabic TC research, like Cross Validation [3], Chi Square (CHI) [5,6,16,[55][56][57][58], Information Gain(IG) [7,45,55], Document Frequency (DF) [45,55], Mutual Information (MI) [45], Correlation Coefficient (CC) [45], Binary Particle Swarm Optimization-K-Nearest-Neighbor (BPSO-KNN) [9], Semi-Automatic Categorization Method (SACM) and Automatic Categorization Method (ACM) [59]. On the other hand, [60] selected features randomly and [15] didn't apply FS at all.…”

Section: A Feature Selection (Fs)mentioning

confidence: 99%

“…Chi Square (CHI) is used in the experiments of this research as a FS metric for selecting the most discriminating features in the dataset. CHI has proved to record high accuracy in classifying both English [7,6,16,[61][62][63][64][65][66] and Arabic [5,6,16,[55][56][57][58] texts. The CHI FS metric measures the lack of independence between a term and a class.…”

Section: A Feature Selection (Fs)mentioning

confidence: 99%

“…After deciding on the terms to be selected for building the classifier, the terms will be represented in the categorization system using one of the various presentations or weights used in the literature of TC. [3,5,9,14,56,59], Term Frequency (TF) [14,15,55,57,58], Document Frequency (DF) [55], Weighted IDF [14], Normalized Frequency [7,16,[60][61][62][63][64], Boolean [6,55,61,62,64] and other FS methods like Cosine coefficient, Dice coefficient and Jacaard coefficient [68]. In this research, Normalized frequency is used to as a weighting scheme for term representation in the Vector Space Model.…”

Section: A Feature Selection (Fs)mentioning

confidence: 99%

See 1 more Smart Citation

Arabic Text Categorization Using Logistic Regression

Al-Tahrawi¹

2015

IJISA

View full text Add to dashboard Cite

Several Text Categorization (TC) techniques and algorithms have been investigated in the limited research literature of Arabic TC. In this research, Logistic Regression (LR) is investigated in Arabic TC. To the best of our knowledge, LR was never used for Arabic TC before. Experiments are conducted on Aljazeera Arabic News (Alj-News) dataset. Arabic text-preprocessing takes place on this dataset to handle the special nature of Arabic text. Experimental results of this research prove that the LR classifier is a competitive Arabic TC algorithm to the state of the art ones in this field; it has recorded a precision of 96.5% on one category and above 90% for 3 categories out of the five categories of Alj-News dataset. Regarding the overall performance, LR has recorded a macroaverage precision of 87%, recall of 86.33% and Fmeasure of 86.5%.

show abstract

“…The authors in [25] compared three different approaches of Arabic TC: Artificial Neural Networks (ANN), SVMs and BSOCHI-SVM on the Open Source Arabic Corpora (OSAC). Two stemming approaches were used: light and root-based stemming.…”

Section: Related Workmentioning

confidence: 99%

Polynomial Neural Networks versus Other Arabic Text Classifiers

Al-Tahrawi¹

2016

JSW

View full text Add to dashboard Cite

Many Text Classification (TC) algorithms have been proposed for Arabic TC. Polynomial Neural Networks (PNNs) were used recently in English TC, and have proved to be competitive to the state of the art text classifiers in this field. Lately, they were proposed for classifying Arabic documents. In this research paper, an experimental study that directly compares PNNs against five famous classification algorithms in TC is conducted on Aljazeera-News Arabic dataset. All experiments use the same TC settings, like preprocessing, Feature Selection (FS) and reduction criteria, feature weighting and classifier performance evaluation measures. These algorithms are: SVM (Support Vector Machines), NB (Naive Bayes), kNN (k-Nearest-Neighbor), LR (Logistic Regression) and RBF (Radial Basis Function networks). Results reached in this study reveal that PNN are competitive classifiers in the field of Arabic TC.

show abstract

A hybrid BSO-Chi2-SVM approach to Arabic text categorization

Cited by 22 publications

References 13 publications

Detection of Abusive Accounts with Arabic Tweets

Detection of Abusive Accounts with Arabic Tweets

Arabic Text Categorization Using Logistic Regression

Polynomial Neural Networks versus Other Arabic Text Classifiers

Contact Info

Product

Resources

About