2021
DOI: 10.1016/j.engappai.2021.104347
|View full text |Cite
|
Sign up to set email alerts
|

Towards benchmark datasets for machine learning based website phishing detection: An experimental study

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
37
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 55 publications
(40 citation statements)
references
References 22 publications
2
37
1
Order By: Relevance
“…Most of the datasets are not suitable for replication studies because the URLs used to build the dataset (i.e., short-lived websites) cannot be accessed easily and many studies used self-collected datasets using different sources. To address this problem, recently Hannousse and Yahiouche [ 33 ] designed a construction scheme of reproducible datasets, which are also extensible. Their strategy creates a balanced datasets because many available datasets are imbalanced and it was reported that imbalanced datasets may reduce the performance between 5.9 and 42% in term of F1 score [ 21 ].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of the datasets are not suitable for replication studies because the URLs used to build the dataset (i.e., short-lived websites) cannot be accessed easily and many studies used self-collected datasets using different sources. To address this problem, recently Hannousse and Yahiouche [ 33 ] designed a construction scheme of reproducible datasets, which are also extensible. Their strategy creates a balanced datasets because many available datasets are imbalanced and it was reported that imbalanced datasets may reduce the performance between 5.9 and 42% in term of F1 score [ 21 ].…”
Section: Resultsmentioning
confidence: 99%
“…Their strategy creates a balanced datasets because many available datasets are imbalanced and it was reported that imbalanced datasets may reduce the performance between 5.9 and 42% in term of F1 score [ 21 ]. Hannousse and Yahiouche [ 33 ] created a sample dataset by using their set of guidelines to demonstrate the applicability of the approach and showed that Random Forest algorithm works best on this dataset, however, they did not apply deep learning algorithms and planned to analyze them in future work.…”
Section: Resultsmentioning
confidence: 99%
“…We agree that machine learning is a promising approach for the detection of cybersecurity attacks including XSS. However, three main obstacles hinder their usage in practice: lack of interpretability of their predictions, their susceptibility to adversarial attacks and lack of benchmarks [176,177]. The effective adoption of machine learning for cybersecurity attack detection requires to properly mitigate these three problems.…”
Section: Limitations Of Attack Detection Techniquesmentioning
confidence: 99%
“…Additionally, benchmarking is a long standing problem for machine learning based proposals. Benchmarks enable fair comparison of existing solutions and the development of more robust models [177]. In [133], the authors cooperate by proposing a new oversampling algorithm for XSS datasets.…”
Section: Limitations Of Attack Detection Techniquesmentioning
confidence: 99%
“…People's life has become more dependent on cyberspace, data on the cloud, social media networking, web transactions, e-healthcare, e-business, e-learning and education, and e-government services over the last decade, particularly during the coronavirus disease 2019 (COVID-19) epidemic [7]. Active cyber-space users have surpassed 4.66 billion (or 59.5% of the global population) through the channels and services as reported by World Digital Population Report 2021 [8]. Eventually, a lot of sensitive data has been transmitted and stored via cloud computing, which has given hackers many opportunities to impersonate trustworthy enterprises and services to intrude on computer-based systems and mobile platforms illegally using social engineering mimics [9].…”
Section: Introductionmentioning
confidence: 99%