<div align="center"><span lang="EN-US">Malware is a nuisance for smartphone users. The impact is detrimental to smartphone users if the smartphone is infected by malware. Malware identification is not an easy process for ordinary users due to its deeply concealed dangers in application package kit (APK) files available in the Android Play Store. In this paper, the challenges of creating malware datasets are discussed. Long before a malware classification process and model can be built, the need for datasets with representative features for most types of <span>malwares has to be addressed systematically. Only after a quality data set is available can a quality classification model be obtained using machine learning (ML) or deep learning (DL) algorithms. The entire malware classification process is</span> a full pipeline process and sub processes. The authors purposefully focus on the process of building quality malware datasets, not on ML itself, because implementing ML requires another effort after the reliable dataset is fully built. The overall step in creating the malware dataset starts with the extraction of the Android Manifest from the APK file set and ends with the labeling method for all the extracted APK files. The key contribution of this paper is on how to generate datasets systematically from any APK file.</span></div>