A method is developed for assessing the practical persistence of obfuscating transformations of programs based on the calculation of the similarity index for the original, obfuscated and deobfuscated programs. Candidates are proposed for similarity indices, which are based on such program characteristics as the control flow graph, symbolic execution time and degree of coverage for symbolic execution. The control flow graph is considered as the basis for building other candidates for program similarity indicators. On its basis, a new candidate is proposed for the similarity index, which, when calculated, finds the Hamming distance between the adjacency matrices of control flow graphs of compared programs. A scheme for estimating (analyzing) the persistence of obfuscating transformations is constructed, according to which for the original, obfuscated and deobfuscated programs, the characteristics of these programs are calculated and compared in accordance with the chosen comparison model. The developed scheme, in particular, is suitable for comparing programs based on similarity indices. This paper develops and implements one of the key units of the constructed scheme - a block for obtaining program characteristics compiled for the x86/x86 64 architecture. The developed unit allow to find the control flow graph, the time for symbolic execution and the degree of coverage for symbolic execution. Some results of work of the constructed block are given.
The problem of constructing an algorithm for comparing two executable files is considered. The algorithm is based on the construction of similarity features vector for a given pair of programs. This vector is then used to decide on the similarity or dissimilarity of programs using machine learning methods. Similarity features are built using algorithms of two types: universal and specialized. Universal algorithms do not take into account the format of the input data (values of fuzzy hash functions, values of compression ratios). Specialized algorithms work with executable files and analyze machine code (using disassemblers). A total of 15 features were built: 9 features of the first type and 6 of the second. Based on the constructed training set of similar and dissimilar program pairs, 7 different binary classifiers were trained and tested. To build the training set, coreutils programs were used. The results of the experiments showed high accuracy of models based on random forest and k nearest neighbors. It was also found that the combined use of features of both types can improve the accuracy of classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.