Datasets are not Enough: Challenges in Labeling Network Traffic

Guerra, J.E. Castillo; Catania, Carlos; Veas, Eduardo

doi:10.48550/arxiv.2110.05977

Cited by 3 publications

(7 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The characteristics of obtained data are one of the challenges to tackle to achieve successful deployment of ML-based NIDS. Researchers often use benchmark datasets which contain features unobtainable in real-time [51] [25]. Flow-based features provide a useful overview of the activity of a network [1] [24].…”

Section: Methods 31 Flow-based Datamentioning

confidence: 99%

Towards Deployment Shift Inhibition Through Transfer Learning in Network Intrusion Detection

Pawlicki

Kozik

Choraś

2022

Proceedings of the 17th International Conference on Availability, Reliability and Security

View full text Add to dashboard Cite

Currently, machine learning sees growing adoption in numerous domains, including critical applications, like cybersecurity. However, to fully enjoy the benefits of artificial intelligence the end-user has some high barriers to entry to circumnavigate. The deployment of machine-learning-based Network Intrusion Detection Systems requires the collection of labelled data to train the intelligent components. This is an expensive and laborious process, which necessitates expert knowledge in cyberattacks and computer networks. Even when using data collected and labelled on premises, phenomena like concept drift can cause the model to underperform -a concept known as deployment shift. This paper evaluates the use of transfer learning techniques to curb the effects of deployment shift in machine-learning-based network intrusion detection.

show abstract

Section: Methods 31 Flow-based Datamentioning

confidence: 99%

Towards Deployment Shift Inhibition Through Transfer Learning in Network Intrusion Detection

Pawlicki

Kozik

Choraś

2022

Proceedings of the 17th International Conference on Availability, Reliability and Security

View full text Add to dashboard Cite

show abstract

“…An essential aspect of network traffic classification is identifying applications used within the network. However, this task can be challenging due to the limited availability of datasets [ 1 – 3 ]. To advance this field, it is crucial to provide comprehensive and up-to-date datasets.…”

Section: Objectivementioning

confidence: 99%

ITC-net-audio-5: an audio streaming dataset for application identification in network traffic classification

Nikbakht,

Teimouri

2024

BMC Res Notes

View full text Add to dashboard Cite

Objectives An essential aspect of network traffic classification is application identification. This involves capturing and analyzing the traffic patterns of applications. There are a few publicly available datasets that specifically capture streaming data from network-based applications. Therefore, our objective is to generate an up-to-date dataset with a focus on audio streaming data. This dataset can be a valuable resource for identifying audio streaming applications in the field of network traffic classification. Data description The dataset contains network traffic captured during audio streaming communications on five trending applications: Google Meet, Skype, Telegram, WhatsApp, and SoundCloud. It includes 500 files in PCAP format captured by Wireshark and PCAPdroid tools during voice calls and online music playback. The concurrent utilization of these tools facilitates the avoidance of capturing background traffic.

show abstract

“…Realistic and labelled datasets are a necessity when developing data-driven capabilities for both threat hunting and intrusion detection [1], [34], [42]. Datasets used to build such hunting or detection capabilities comes with a large set of requirements from different sources: R1) datasets must contain modern attack data that is representative of current trends [20], [28]; R2) datasets need to be representative and accurate [20]; R3) datasets must provide all the relevant behavioural patterns for malicious and normal activities, and network traces [8], [29];…”

Section: Introductionmentioning

confidence: 99%

“…Both the source code of LADEMU and generated dataset can be found here: https://github.com/FFI-no/Paper-LADEMU R4) datasets must capture the stages and strategies involved in the attacks to defend against Advanced Persistent Threats (APTs) [1]; R5) datasets must contain ground truth 1 of the datapoints; to develop capabilites to detect APTs, or perform kill-chain detection, the labels must be fine-grained and indicate the different stages of an attack/campaign [8], [20], [20]. Satisfying these requirements is far from easy.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LADEMU: a modular & continuous approach for generating labelled APT datasets from emulations

Gjerstad

Kadiric

Grov

et al. 2022

2022 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Development and evaluation of data-driven capabilities for both threat hunting and intrusion detection require highquality and up-to-date datasets. The generation of such datasets poses multiple challenges, which has led to a general lack of suitable datasets for this domain.One such difficulty is the ability to correctly label each datapoint at a suitable level of granularity. In this paper, we argue that the challenges faced when labelling datasets can to some degree be decoupled from realistic emulations of up-to-date attacks and benign behaviours. We propose a modular labelling approach that can be combined with existing emulation platforms that provide the necessary details used for labelling. A proof-ofconcept implementation is provided with our LADEMU (Labelled Apt Datasets from EMUlations) tool, which is integrated with the Mitre CALDERA emulation platform and uses the GHOSTS framework for benign behaviour. LADEMU captures both host and network logs and labels them at a sufficient level of detail to separate the various attack steps. This provides dataset support for the development of data-driven APT, multi-step and killchain capabilities. As a case, LADEMU is used to generate a labelled dataset from an intelligence-driven emulation plan of an advanced persistent threat (APT) group.

show abstract

Datasets are not Enough: Challenges in Labeling Network Traffic

Cited by 3 publications

References 57 publications

Towards Deployment Shift Inhibition Through Transfer Learning in Network Intrusion Detection

Towards Deployment Shift Inhibition Through Transfer Learning in Network Intrusion Detection

ITC-net-audio-5: an audio streaming dataset for application identification in network traffic classification

LADEMU: a modular & continuous approach for generating labelled APT datasets from emulations

Contact Info

Product

Resources

About