2021
DOI: 10.1145/3446905
|View full text |Cite
|
Sign up to set email alerts
|

On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection

Abstract: Malware detection at scale in the Android realm is often carried out using machine learning techniques. State-of-the-art approaches such as DREBIN and MaMaDroid are reported to yield high detection rates when assessed against well-known datasets. Unfortunately, such datasets may include a large portion of duplicated samples, which may bias recorded experimental results and insights. In this article, we perform extensive experiments to measure the performance gap that occurs when datasets are de-duplicated. Our… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
30
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
3

Relationship

4
6

Authors

Journals

citations
Cited by 51 publications
(31 citation statements)
references
References 74 publications
1
30
0
Order By: Relevance
“…rough this process, it classifies samples and regresses a binary division into classification, continuation, or numerical types. As a result of this study, the DT algorithm was used the fourth most [32,37,39,44,46,47,53,56,58,59,62,64,65,68,71,72,75,76,78,85,86,92,96,98,101,111,[120][121][122][123][124][125][126][127][128][129][130][131][132][133][134].…”
Section: 31mentioning
confidence: 99%
“…rough this process, it classifies samples and regresses a binary division into classification, continuation, or numerical types. As a result of this study, the DT algorithm was used the fourth most [32,37,39,44,46,47,53,56,58,59,62,64,65,68,71,72,75,76,78,85,86,92,96,98,101,111,[120][121][122][123][124][125][126][127][128][129][130][131][132][133][134].…”
Section: 31mentioning
confidence: 99%
“…What's more, attackers continue to update their fraud techniques to bypass protection software as well as well-trained machine learning models in order to victimize users and businesses. In front of the increasing diiculty of Android malware defenses, it is non-trivial to construct a robust and transparent defense model or system only by traditional machine learning techniques [191].…”
Section: Introductionmentioning
confidence: 99%
“…Many other researchers [7,30,44] also point out the misalignment between code comments and natural user queries, and report it as a threat to the validity of their approaches. As mentioned in [31,57], improving the quality of the training data is still a research opportunity for machine learning, including DL-based code search models. Considering that there are still plenty of comments close to actual user queries and naturally paired with high-quality code snippets, a promising solution is to filter out the noisy ones.…”
Section: Introductionmentioning
confidence: 99%