Voting-based Classification for E-mail Spam Detection

Al-Shboul, Bashar; Hakh, Heba; Faris, Hossam; Aljarah, Ibrahim; Alsawalqah, Hamad

doi:10.5614/itbj.ict.res.appl.2016.10.1.3

Cited by 16 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This step involved converting email messages into a format that could be processed by a machine learning algorithm. Email spam features are obtained from three different methods, namely, the Heuristic approach, Term frequency (TF) analysis, and behavior approach [27]. In the first approach, emails are mined to discover and generate patterns and rules, while in the TF analysis; every word in an e-mail is specified as a feature.…”

Section: Feature Extractionmentioning

confidence: 99%

Low Time Complexity Model for Email Spam Detection using Logistic Regression

Mrisho¹,

Ndibwile²,

Sam³

2021

IJACSA

View full text Add to dashboard Cite

Spam emails have recently become a concern on the Internet. Machine learning techniques such as Neural Networks, Naïve Bayes, and Decision Trees have frequently been used to combat these spam emails. Despite their efficiency, time complexity in high-dimensional datasets remains a significant challenge. Due to a large number of features in high-dimensional datasets, the intricacy of this problem grows exponentially. The existing approaches suffer from a computational burden when thousands of features are used (high-time complexity). To reduce time complexity and improve accuracy in high-dimensional datasets, extra steps of feature selection and parameter tuning are necessary. This work recommends the use of a hybrid logistic regression model with a feature selection approach and parameter tuning that could effectively handle a big dimensional dataset. The model employs the Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction method to mitigate the drawbacks of Term Frequency (TF) to obtain an equal feature weight. Using publicly available datasets (Enron and Lingspam), we compared the model's performance to that of other contemporary models. The proposed model achieved a low level of time complexity while maintaining a high level of spam detection rate of 99.1%.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

Low Time Complexity Model for Email Spam Detection using Logistic Regression

Mrisho¹,

Ndibwile²,

Sam³

2021

IJACSA

View full text Add to dashboard Cite

show abstract

“…This data set is available freely for research purposes (https://github.com/erayon/Email-spam-filter-naive-bayes-classifier-scikit-learntext-classification/tree/master/CSDMC2010_SPAM/CSDMC2010_SPAM, accessed 10 EL 38,3 January 2019). This data set has been used in earlier research studies (Al-Shboul et al, 2016;Hijawi et al, 2017;Liu and Moh, 2016;Mercer, 2013, 2016). Characteristics of the data set are shown in Table 4.…”

Section: Data Setmentioning

confidence: 99%

A feature-centric spam email detection model using diverse supervised machine learning algorithms

Zamir

Khan

Mehmood

et al. 2020

View full text Add to dashboard Cite

Purpose This research study proposes a feature-centric spam email detection model (FSEDM) based on content, sentiment, semantic, user and spam-lexicon features set. The purpose of this study is to exploit the role of sentiment features along with other proposed features to evaluate the classification accuracy of machine learning algorithms for spam email detection. Design/methodology/approach Existing studies primarily exploits content-based feature engineering approach; however, a limited number of features is considered. In this regard, this research study proposed a feature-centric framework (FSEDM) based on existing and novel features of email data set, which are extracted after pre-processing. Afterwards, diverse supervised learning techniques are applied on the proposed features in conjunction with feature selection techniques such as information gain, gain ratio and Relief-F to rank most prominent features and classify the emails into spam or ham (not spam). Findings Analysis and experimental results indicated that the proposed model with sentiment analysis is competitive approach for spam email detection. Using the proposed model, deep neural network applied with sentiment features outperformed other classifiers in terms of classification accuracy up to 97.2%. Originality/value This research is novel in this regard that no previous research focuses on sentiment analysis in conjunction with other email features for detection of spam emails.

show abstract

“…Vote ensemble utilizes several combination algorithms to makes it predictions. These combination rules include: Average Probabilities, Minimum Probabilities, Maximum Probabilities, Product of Probabilities, and Majority Voting [24], . It creates series of classifiers and then predicts based on either the mode or mean of the base classifiers.…”

Section: Vote Ensemblementioning

confidence: 99%

“…It creates series of classifiers and then predicts based on either the mode or mean of the base classifiers. Majority Voting has been used majorly for prediction as the output of the ensemble or classification is the label with the highest number of votes from the base classifiers [24]- [28]. It can also be weighted [30], that is, assigning more weight on classifiers which are more likely correct [31].…”

Section: Vote Ensemblementioning

confidence: 99%

Stacked Ensemble for Bioactive Molecule Prediction

Petinrin

Saeed

2019

IEEE Access

View full text Add to dashboard Cite

Bioactive molecular compounds are essential for drug discovery. The biological activity of these compounds needs to be predicted as this is used to determine the drug-target ability. As ineffective drugs are discarded after production, leading to resource and time wastage, it is important to predict bioactive molecules with models having high predictive performance. This study utilizes the stacked ensemble which uses the prediction of multiple base classifiers as features, used to train a meta classifier which makes the final prediction. Using three datasets DS1, DS2, and DS3 gotten from MDL Drug Data Report (MDDR) database, the performance of stacked ensemble was compared to three other ensembles: adaboost, bagging, and vote ensemble, based on different evaluation criteria and also a statistical method, Kendall's W test. The accuracy of Stacked ensemble ranged from 96.7002%, 98.2260% and 94.9007% for the three datasets respectively, although Vote had the best accuracy using dataset DS2 which consist of structurally homogeneous bioactive molecules. Also, using Kendall's W test to rank the ensembles, Stacked ensemble was ranked best with datasets DS1 and DS3, with both having a mean average of 4.00 and an overall level of agreement, W, of 0.986 and 1.000 respectively. Using dataset DS2, it was ranked after Vote and Adaboost with mean average of 2.33 and an overall level of agreement, W of 0.857. Stacked ensemble is recommended for the prediction of heterogeneous bioactive molecules during drug discovery and can also be implemented in other research areas.INDEX TERMS Bioactive molecule prediction, chemoinformatics, drug discovery, ensemble, stacked ensemble.

show abstract

Voting-based Classification for E-mail Spam Detection

Cited by 16 publications

References 22 publications

Low Time Complexity Model for Email Spam Detection using Logistic Regression

Low Time Complexity Model for Email Spam Detection using Logistic Regression

A feature-centric spam email detection model using diverse supervised machine learning algorithms

Stacked Ensemble for Bioactive Molecule Prediction

Contact Info

Product

Resources

About