Detecting Malicious URLs Using Lexical Analysis

Mamun, Mohammad Saiful Islam; Rathore, Muhammad Ahmad; Lashkari, Arash Habibi; Stakhanova, Natalia; Ghorbani, Ali A.

doi:10.1007/978-3-319-46298-1_30

Cited by 127 publications

(73 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Several studies analyzed the importance and impact of the features used for learning [2], [14], [15], [17], [18], [20], [22], [26], [28], [29], [37], [40], [44], [56], [64]. Table 2 summarizes them from four aspects: 1) feature ranking methods, 2) top five features, and 3) dataset ratio and 4) dataset sources.…”

Section: B Feature Importancementioning

confidence: 99%

An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs

et al. 2020

View full text Add to dashboard Cite

We perform an in-depth, systematic benchmarking study and evaluation of phishing features on diverse and extensive datasets. We propose a new taxonomy of features based on the interpretation and purpose of each feature. Next, we propose a benchmarking framework called 'PhishBench,' which enables us to evaluate and compare the existing features for phishing detection systematically and thoroughly under identical experimental conditions, i.e., unified system specification, datasets, classifiers, and evaluation metrics. PhishBench is a first in the field of benchmarking phishing related research and incorporates thorough and systematic evaluation and feature comparison. We use PhishBench to test methods published in the phishing literature on new and diverse datasets to check their robustness and scalability. We study how dataset characteristics, e.g., varying legitimate to phishing ratios and increasing the size of imbalanced datasets, affect classification performance. Our results show that the imbalanced nature of phishing attacks affects the detection systems' performance and researchers should take this into account when proposing a new method. We also found that retraining alone is not enough to defeat new attacks. New features and techniques are required to stop attackers from fooling detection systems. INDEX TERMS Feature engineering, feature taxonomy, framework, phishing email, phishing URL, phishing website.

show abstract

Section: B Feature Importancementioning

confidence: 99%

An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs

et al. 2020

View full text Add to dashboard Cite

show abstract

“…It is mostly infected and commonly used for malware detection and analysis [182]. The uniform resource locator (URL) dataset [165] contains instances of Internet traffic. It was mainly proposed to blacklist malicious URLs.…”

Section: B Com M Only Used Security Datasetsmentioning

confidence: 99%

A Survey on Machine Learning Techniques for Cyber Security in the Last Decade

et al. 2020

View full text Add to dashboard Cite

Pervasive growth and usage of the Internet and mobile applications have expanded cyberspace. The cyberspace has become more vulnerable to automated and prolonged cyberattacks. Cyber security techniques provide enhancements in security measures to detect and react against cyberattacks. The previously used security systems are no longer sufficient because cybercriminals are smart enough to evade conventional security systems. Conventional security systems lack efficiency in detecting previously unseen and polymorphic security attacks. Machine learning (ML) techniques are playing a vital role in numerous applications of cyber security. However, despite the ongoing success, there are significant challenges in ensuring the trustworthiness of ML systems. There are incentivized malicious adversaries present in the cyberspace that are willing to game and exploit such ML vulnerabilities. This paper aims to provide a comprehensive overview of the challenges that ML techniques face in protecting cyberspace against attacks, by presenting a literature on ML techniques for cyber security including intrusion detection, spam detection, and malware detection on computer networks and mobile networks in the last decade. It also provides brief descriptions of each ML method, frequently used security datasets, essential ML tools, and evaluation metrics to evaluate a classification model. It finally discusses the challenges of using ML techniques in cyber security. This paper provides the latest extensive bibliography and the current trends of ML in cyber security.

show abstract

“…A security database [5], [11], [12] that stores the known phishing attacks provides an ideal testbed for machine learning-based URL classification task [7] with a relatively closed environment. Various deep learning methods such as convolutional neural network (CNN) [1] and long short-term memory (LSTM) are proposed, as well as the LSTM-based generative adversarial network (GAN) [8] for exploit the class imbalance issue in the field of the phishing detection.…”

Section: Related Workmentioning

confidence: 99%

Integrating Deep Learning with First-Order Logic Programmed Constraints for Zero-Day Phishing Attack Detection

Cho

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Considering the fatality of phishing attacks that are emphasized by many organizations, the inductive learning approach using reported malicious URLs has been verified in the field of deep learning. However, the deep learning-based method mainly focused on the fitting of a classification task via historical URL observation shows a limitation of recall due to the characteristics of zero-day attack. In order to model the nature of a zero-day phishing attack in which URL addresses are generated and discarded immediately, an approach that utilizes the expert knowledge is promising. We introduce the integration method of deep learning and logic programmed domain knowledge to inject the real-world constraints. We design neural and logic classifiers and propose the joint learning method of each component based on the traditional neuro-symbolic integration. Extensive experiments on three real-world datasets consisting of 222,541 URLs showed the highest recall among the latest deep learning methods, despite the hostile class-imbalanced condition. We demonstrate that the optimized weighting between neural and logic component has an effect of improving the recall over 3% compared to the existing methods.

show abstract

Detecting Malicious URLs Using Lexical Analysis

Cited by 127 publications

References 18 publications

An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs

An In-Depth Benchmarking and Evaluation of Phishing Detection Research for Security Needs

A Survey on Machine Learning Techniques for Cyber Security in the Last Decade

Integrating Deep Learning with First-Order Logic Programmed Constraints for Zero-Day Phishing Attack Detection

Contact Info

Product

Resources

About