Efficient software packet processing on heterogeneous and asymmetric hardware architectures

Koromilas, Lazaros; Vasiliadis, Giorgos; Manousakis, Ioannis; Ioannidis, Sotiris

doi:10.1145/2658260.2658265

Cited by 14 publications

(7 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, they work on a per-packet base or using simple state machines, and they are easily amenable to parallelization. However, since pattern matching is costly (e.g., a core can cope with only ~100 Mbit/s), NIDS scalability is achieved with a large number of GPU cores as in the case of MIDeA and Kargus [2], with NPUs as in Koromilas [4] and DPI-S [3], or finally with FPGAs as in Das [10] and Jaic [11]. Fig.…”

Section: Years Of High Speed Traffic Processingmentioning

confidence: 99%

“…3a). This is offered by solutions such as PF_RING ZC 4 where custom per-packet load balancing can be coded and applied on the aggregate traffic received from the so called "DNA cluster", i.e., a group of NICs. In this case, all packets received from the NICs are passed to the DNA cluster process, which (i) timestamps and (ii) forwards them to the correct processing engine.…”

Section: A Packet Acquisition and Per-flow Load-balancingmentioning

confidence: 99%

“…Software developers have explored multicore CPUs, Graphical Processing Units (GPUs), Network Processing Units (NPUs), and FPGA architectures. This is testified by seminal [1] and more recent works [2], [3], [4] successfully scaling and optimizing multi-core Network Intrusion Detection Systems (NIDS), where a large set of rules have to be checked on a per-packet base. Fewer efforts have been instead devoted in the area of Statistical Traffic Analyzers (STAs) which instead aim to collect both basic statistics (e.g., TCP RTT or congestion events) and more articulated indexes (e.g., performance for video streaming and Webpage load time).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Traffic Analysis with Off-the-Shelf Hardware: Challenges and Lessons Learned

Trevisan¹,

Finamore²,

Mellia³

et al. 2017

IEEE Commun. Mag.

View full text Add to dashboard Cite

In recent years, the progress in both hardware and software enabled user-space applications to capture packets at 10 Gbit/s line rate. However, processing packets at such rates with software running on Commercial Off-The-Shelf (COTS) hardware is still far from being trivial. In the literature, this challenge has been extensively studied for Network Intrusion Detection Systems (NIDS), where operations are per-packet and easier to parallelize also thanks to hardware acceleration. Conversely, the scalability of Statistical Traffic Analyzers (STA) is intrinsically more complex as it implies tracking per-flow state to collect statistics. This challenge received less attention so far, and it is the focus of this work.We discuss the design choices to enable a STA to collects hundreds of per-flow metrics at a multi 10 Gbit/s line rate. We leverage a handful of hardware advancements proposed over the last years (e.g., RSS queues, NUMA architecture), and we provide insights on the trade-offs they imply when combined with state of the art packet capture libraries and multi-process paradigm. We outline the principles to achieve an optimized STA, and we apply them to engineer DPDKStat, a solution combining the Intel DPDK framework with the traffic analyzer Tstat. Using traces collected from real networks, we demonstrate that DPDKStat achieves 40 Gbit/s of aggregated rate with a single COTS PC.

show abstract

Section: Years Of High Speed Traffic Processingmentioning

confidence: 99%

Section: A Packet Acquisition and Per-flow Load-balancingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Traffic Analysis with Off-the-Shelf Hardware: Challenges and Lessons Learned

Trevisan¹,

Finamore²,

Mellia³

et al. 2017

IEEE Commun. Mag.

View full text Add to dashboard Cite

show abstract

“…It uses heterogeneous earliest finish time (HEFT) scheduling algorithm, which is the best among greedy algorithms, and automatically calibrates the performance model by observing task completion times. Koromilas et al [32] tackles asymmetric scheduling problem of network packet processing workloads running on both integrated GPUs and discrete GPUs. Differently from above work, our framework targets a complex system where the performance of heterogeneous processors have interdependencies to each other and IO as well as computation has critical impacts to the performance.…”

Section: Related Workmentioning

confidence: 99%

NBA (network balancing act)

Kim

Jang

Lee

et al. 2015

Proceedings of the Tenth European Conference on Computer Systems

View full text Add to dashboard Cite

We present the NBA framework, which extends the architecture of the Click modular router to exploit modern hardware, adapts to different hardware configurations, and reaches close to their maximum performance without manual optimization. NBA takes advantages of existing performance-excavating solutions such as batch processing, NUMA-aware memory management, and receiveside scaling with multi-queue network cards. Its abstraction resembles Click but also hides the details of architecturespecific optimization, batch processing that handles the path diversity of individual packets, CPU/GPU load balancing, and complex hardware resource mappings due to multi-core CPUs and multi-queue network cards. We have implemented four sample applications: an IPv4 and an IPv6 router, an IPsec encryption gateway, and an intrusion detection system (IDS) with Aho-Corasik and regular expression matching. The IPv4/IPv6 router performance reaches the line rate on a commodity 80 Gbps machine, and the performances of the IPsec gateway and the IDS reaches above 30 Gbps. We also show that our adaptive CPU/GPU load balancer reaches near-optimal throughput in various combinations of sample applications and traffic conditions.

show abstract

“…Both algorithms have been shown to achieve great performance in graphics processors [9,19], while at the same time both have many optimized CPU implementations to compare with. After benchmarking of various implementations, we found the open source password cracker John The Ripper [16] to achieve the best performance among many others.…”

Section: Brute-force Unpackingmentioning

confidence: 99%

GPU-assisted malware

Vasiliadis

Polychronakis

Ioannidis

2014

Int. J. Inf. Secur.

Self Cite

View full text Add to dashboard Cite

Malware writers constantly seek new methods to increase the infection lifetime of their malicious software. To that end, techniques such as code unpacking and polymorphism have become the norm for hindering automated or manual malware analysis and evading virus scanners. In this paper, we demonstrate how malware can take advantage of the ubiquitous and powerful graphics processing unit (GPU) to increase its robustness against analysis and detection. We present the design and implementation of brute-force unpacking and runtime polymorphism, two code armoring techniques based on the general-purpose computing capabilities of modern graphics processors. By running part of the malicious code on a different processor architecture with ample computational power, these techniques pose significant challenges to existing malware detection and analysis systems, which are tailored to the analysis of CPU code. We also discuss how upcoming GPU features can be used to build even more robust and evasive malware, as well as directions for potential defenses against GPU-assisted malware.

show abstract

Efficient software packet processing on heterogeneous and asymmetric hardware architectures

Cited by 14 publications

References 26 publications

Traffic Analysis with Off-the-Shelf Hardware: Challenges and Lessons Learned

Traffic Analysis with Off-the-Shelf Hardware: Challenges and Lessons Learned

NBA (network balancing act)

GPU-assisted malware

Contact Info

Product

Resources

About