This article is devoted to development of an algorithm for automated analysis and transformation of a log message into a list of features in the form of a fixed-length vector and accumulation of the obtained vectors into a single dataset. The resulted dataset is proposed to be used in machine learning based anomaly detection systems. An additional requirement for the algorithm being developed is the diversity of protocols used to collect log messages in a computer system. These goals were achieved by develop of the software package. The software package collect and parse data from log messages in order to isolate and encode the features from log messages. The software package is enable to collect log messages by several protocols: syslog, SNMP, SQL, reading text and binary files. The data extracted from the log messages of the computing system is considered. The support of LUA scripts for data enrichment is applied. The list of features is generated. The method to encode text data extracted from log messages is proposed. The transformation algorithm of an arbitrary log message into a features vector of fixed dimension is proposed. A methodology for the formation of a dataset for subsequent use in machine learning of the anomaly detection system in a computing system is provided. An example of a dataset storage structure is given.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.