Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people's consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email's body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.