Abstract-Performance problems in parallel programs manifest as lack of scalability. These scalability issues are often very difficult to debug. They can stem from synchronization overhead, poor thread scheduling decisions, or contention for hardware resources, such as shared caches. Traditional profiling tools attribute program cycles to different functions, but do not generate immediate insight into issues limiting scalability. Profiling information is very program-specific and is usually processed manually by a human expert in a time-consuming and cumbersome process.Our experience in tuning performance of parallel applications led us to discover that performance tuning can be considerably simplified, and even to some degree automated, if profiling measurements are organized according to several intuitive performance factors common to most parallel programs. In this work we present these factors and propose a hierarchical framework composing them. We present three case studies where analyzing profiling data according to the proposed principle led us to improve performance of three parallel programs by a factor of 6-20×. Our work lays foundation for new ways of organizing and visualizing profiling data in performance tuning tools.
Network Function (NF) deployments suffer from poor link goodput, because popular NFs such as firewalls process only packet headers while receiving and transmitting complete packets. As a result, unnecessary packet payloads needlessly consume link bandwidth. We introduce PayloadPark, which improves goodput by temporarily parking packet payloads in the stateful memory of dataplane programmable switches. PayloadPark forwards only packet headers to NF servers, thereby saving bandwidth between the switch and the NF server. PayloadPark is a transparent in-network optimization that complements existing approaches for optimizing NF performance on end-hosts. We prototyped PayloadPark on a Barefoot Tofino ASIC using the P4 language. Our prototype, when deployed on a top-of-rack switch, can service up to 8 NF servers using less than 40% of the on-chip memory resources. The prototype improves goodput by 10-26% for a Firewall → N AT NF chain and reduces PCIe bandwidth load by 2-58%. With workloads that have datacenter network traffic characteristics, PayloadPark provides a 13% goodput gain with a Firewall → N AT → LB NF chain, without latency penalty. In this scenario, we can further increase the goodput gain to 28% by using packet recirculation. CCS CONCEPTS • Networks → Programmable networks; Middle boxes / network appliances.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.