2019
DOI: 10.1145/3291061
|View full text |Cite
|
Sign up to set email alerts
|

A Close Look at a Daily Dataset of Malware Samples

Abstract: The number of unique malware samples is growing out of control. Over the years, security companies have designed and deployed complex infrastructures to collect and analyze this overwhelming number of samples. As a result, a security company can collect more than 1M unique files per day only from its different feeds. These are automatically stored and processed to extract actionable information derived from static and dynamic analysis. However, only a tiny amount of this data is interesting for security resear… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 49 publications
(32 citation statements)
references
References 20 publications
0
32
0
Order By: Relevance
“…As such, we wanted our dataset to mimic those that are regularly analyzed by security companies. To satisfy this requirement, for our malicious dataset we downloaded fresh samples submitted to VirusTotal, i.e., samples observed for the first time on the same day in which we downloaded and analyzed them (this is what Ugarte-Pedrero et al [98] call the catch of the day). This was the only criteria we used for our selection.…”
Section: Sample Selectionmentioning
confidence: 99%
“…As such, we wanted our dataset to mimic those that are regularly analyzed by security companies. To satisfy this requirement, for our malicious dataset we downloaded fresh samples submitted to VirusTotal, i.e., samples observed for the first time on the same day in which we downloaded and analyzed them (this is what Ugarte-Pedrero et al [98] call the catch of the day). This was the only criteria we used for our selection.…”
Section: Sample Selectionmentioning
confidence: 99%
“…We gathered 237,288 hashes from VT. We attempted to balance the presence of families and variants to avoid over-representing certain prolific families. Malware feeds like VT usually distribute large numbers of variants of the same polymorphic families [68]. The prevalence of a given malware family in such a feed is not necessarily related to its freshness, impact, or prevalence in the wild.…”
Section: A Dataset Compositionmentioning
confidence: 99%
“…By leveraging this sample selection method, we avoided the over-representation of prominent polymorphic families (such as worms or file infectors like Virut or Allaple) [68] or identical variants of the same family. The removal of these variants balanced the representation of every family in the dataset.…”
Section: A Dataset Compositionmentioning
confidence: 99%
“…Cybersecurity is a highly dynamic field; with threat behavior constantly evolving [7], Trend reports published by antivirus companies show that the number of unique malicious executable files has risen from less than one million to over one billion between 2008 and 2014 [8,9]. Security companies that analyse malware now routinely collecting over one million unique files per day [3]. Threat sophistication and anomaly distributions also vary significantly by context; for example certain industries may be highly attractive as targets for hackers, and may therefore exhibit higher frequencies or specific types of malware.…”
Section: State Of the Artmentioning
confidence: 99%