The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering

Gualberto, Éder S.; Sousa, Rafael T. de; Vieira, Thiago P. De B.; Costa, João Paulo; Duque, Cláudio Gottschalg

doi:10.1109/access.2020.3043396

Cited by 33 publications

(25 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The TF-IDF [9] is a method often deployed in information retrieval, text mining, and in recent years it has been applied also in phishing email detection [10] [5]. We applied TF-IDF in the emails' clean text to extract the text-based features.…”

Section: Methods 1: Tf-idfmentioning

confidence: 99%

“…With 10 features their approach reached 99.95% accuracy with the XGBoost algorithm. An extension of their previous work [5] was presented in [10], where the authors developed a multi-stage approach to discover the purpose of the email (phishing or benign). The text-based features were extracted via the TF-IDF technique and two methods were employed to process the features further.…”

Section: Related Workmentioning

confidence: 99%

“…The second method accomplished the best performance with the XGBoost algorithm (100% accuracy and 100% F1-score) using the SpamAssassin 3 and Nazario datasets [28]. Both [5] and [10] attain comparable results on the same dataset that contains 4150 benign and 2279 phishing emails; however, in [5] the evaluation was performed using cross-validation, while in [10] the dataset was separated into 70% for training and 30% for testing (namely, 1261 benign and 678 phishing emails). Moreover, the evolution of phishing emails has not been considered in any of these researches as the deployed phishing emails are outdated.…”

Section: Related Workmentioning

confidence: 99%

“…In general the major limitations that were identified in the literature are: limited evaluation metrics or metrics that are inappropriate to measure the classifiers performance (e.g., in case of imbalanced data) [12] [25] [30] [11], the phishing emails that have been deployed for evaluation purposes are old [25] [15] [31] [5] [10], and in some works the considered textual features are not robust [26] [27]. Moreover, although NLP and ML have been utilized in phishing email detection for several years, the literature misses proofs regarding which NLP method works better for phishing email detection.…”

Section: Related Workmentioning

confidence: 99%

“…The Term Frequency -Inverse Document Frequency (TF-IDF) [9] is a wellknown method to measure the significance of a word in a document. In the last couple of years it has been the most used NLP technique in the phishing email detection field, where it was deployed as a weighting factor of the words that appear in the email corpus [10] [5] [11] [12]. Word2Vec [13] is a popular method for the creation of word embeddings, namely vector representations of a word, which has seen a few applications in the phishing email detection for the identification of word associations between different emails of an email corpus [14] [15].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

Bountakas

Koutroumpouchos

Xenakis

2021

Proceedings of the 16th International Conference on Availability, Reliability and Security

View full text Add to dashboard Cite

Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people's consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email's body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.

show abstract

Section: Methods 1: Tf-idfmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%