“…As these numbers show, any approach that fails to consider packed benign samples when designing and evaluating a malware detection approach ultimately results in a substantial number of false positives on real-world data. This is especially a concern for machine-learning-based approaches, which, in the absence of reliable and fresh ground truth, frequently rely on labels from anti-malware products available on VirusTotal [22,43,86,88,97]. Given the disagreement of anti-malware products in labeling samples [42,48,65,99], a common practice is to sanitize a dataset, for example, by considering decisions from a selected set of anti-malware products, or, as another example, by using a voting-based consensus.…”