Cyber phishing is a theft of personal information in which phishers, also known as attackers, lure users to surrender sensitive data such as credentials, credit card and bank account information, financial details, and other behavioral data. The ones who commit such crimes are called 'Phishers' or 'Attackers.' Phishers act as if they are reliable sources to lure users to gain access/control to their system. Phishing detection is becoming a crucial research area, attracting increased focus as the number of phishing attacks grows, since e-commerce and internet transactions are growing rapidly.Furthermore, because phishers are innovating various techniques, phishing detection has become a primary concern of developers. Moreover, as long as phishers are innovating their schemes, researchers have no way except to tackle every possible detection technique. Detection mechanisms come along with a vast variety of techniques since no one can be sure which techniques phishers are trying to come up with. Therefore, it is still an interesting yet challenging issue.We focus on URL (Uniform Resource Locator) -based phishing detection techniques since URL is a significant criterium in preventing phishing attacks without accessing to webpage directly. Hypothesis is that phishers create fake websites with less content information on the webpage as possible -showing only a few words in the webpage. When phishers rarely show content information in a webpage, we cannot retrieve enough features from the webpage by using detection approaches such as content and visual similarity-based. To overcome the limitation of those approaches, we focus on URL-based detection since we can extract features by analyzing URLs only, without accessing to the webpage.Since previous works extract features of specific special characters, we assume that non-alphanumeric (NAN) characters distribution highly impact phishing URLs. Our contribution is to propose a new feature called entropy of NAN characters and compare with the previously used features, which are from previous researches. To be noted, those previous features are not from only one specific work but are applied on several works. We also emphasize on features engineering because selecting features (NAN characters in our work) affects the most on performance. As it is difficult to gather exactly same datasets used by previous works, we work on our datasets and compare with our contributed feature. We work on two datasets (balanced and imbalanced) and perform feature selection and hyperparameter tuning. We achieved 96% of ROC_AUC with balanced dataset and 89% with imbalanced dataset, which outperforms 87% in balanced and 84% in imbalanced datasets, respectively. Then, we summarize our findings and suggestions for better outcome of phishing detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.