Certain workloads such as in-memory databases are inherently hard to scale-out and rely on cache-coherent scale-up non-uniform memory access (NUMA) systems to keep up with the ever-increasing demand for compute resources. However, many parallel programming frameworks such as OpenMP do not make efficient use of large scale-up NUMA systems as they do not consider data locality sufficiently. In this work, we present PGASUS, a C++ framework for NUMA-aware application development that provides integrated facilities for NUMA-aware task parallelism and data placement.The framework is based on an extensive review of parallel programming languages and frameworks to incorporate the best practices of the field. In a comprehensive evaluation, we demonstrate that PGASUS provides average performance improvements of 1.56× and peak performance improvements of up to 4.67×across a wide range of workloads.
K E Y W O R D Snon-uniform memory access, programming model, scale-up computing
INTRODUCTIONThe ever-increasing demand for compute resources necessitates continuous improvements in computer technology. Even though accelerators such as graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) are commonly used in many data-intensive applications, the majority of workloads still rely on the flexibility and versatility of multicore central processing units (CPUs). 1 Many of those CPU-based workloads can be adapted to scale-out across multiple systems to provide sufficient compute resources. Still, certain workloads such as in-memory databases 2 or de Novo genome assembly 3 are inherently hard to scale out and therefore require as many resources as possible in a single scale-up system.The most basic multi-CPU systems have employed uniform memory access (UMA) architectures, where multiple multicore CPUs are attached to a shared memory subsystem through facilities such as a front-side bus (FSB). All ×86-based systems until the introduction of the SledgeHammer and Nehalem micro architecture in 2009 were built with this memory architecture. From a software developers' perspective, UMA systems align conveniently with the shared memory programming model. Unfortunately, sharing the memory subsystem with all other multicore CPUs severely limits the scalability of multiprocessor systems, both in the number of multicore CPUs as well as in the amount of memory that can be accommodated in a single system.Non-uniform memory access (NUMA) systems avoid this bottleneck, as each multicore CPU is equipped with dedicated memory controllers.Memory attached to other multicore CPUs can still be accessed transparently through inter-CPU interconnects such as Ultra Path Interconnect (UPI), Infinity Fabric (IF), and Power with A-bus, X-bus, OpenCAPI, and NVLink (PowerAXON). However, remote memory access operations incur increased latencies and reduced bandwidth, especially on systems with more than four multicore CPUs where fully meshed connectivity among CPUs is no longer feasible. State-of-the-art NUMA systems support up to 32 multicore C...