Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation.A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.However, labeled data sets are necessary for training supervised data mining methods (e.g. classification algorithms) and provide the basis for evaluating the performance of supervised as well as unsupervised data mining algorithms.Objective. Large training data sets with high variance can increase the robustness of anomaly-based intrusion detection methods. Therefore, we intend to build a generative model which allows us to generate realistic flow-based network traffic. The generated data can be used to improve the training of anomaly-based intrusion detection methods as well as for their evaluation. To that end, we propose an approach that is able to learn the characteristics of collected network traffic and generates new flow-based network traffic with the same underlying characteristics.
Approach and Contributions. Generative Adversarial Networks (GANs) [4]are a popular method to generate synthetic data by learning from a given set of input data. GANs consist of two networks, a generator network G and a 2 discriminator network D. The generator network G is trained to generate synthetic data from noise. The discriminator network D is trained to distinguish generated synthetic data from real world data. The generator network G is trained by the output signal gradient of the discriminator network D. G and D are trained iteratively until the generator network G is able to fool the discriminator network D. GANs achieve remarkably good results in image generation [5,6,7,8]. Furthermore, GANs have also been used for generating text [9] or molecules [10]. This work uses GANs to generate complete flow-based network traffic with all typical attributes. To the best of our knowledge, this is the first work that uses GANs for this purpose. GANs can only process continuous input attributes. This poses a major challenge since flow-based network data consist of continuous and categorical attributes. Consequently, we analyze different preprocessing strategies to transform categorical attributes of flow-based network data into continuous attributes. The first method simply ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.