Zoltán Majó scite author profile

2011

Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory controller of other processors. These other processors are reached via the cross-processor interconnect. As a consequence a processor's memory controller must satisfy two kinds of requests: those that are generated by the local cores and those that arrive via the interconnect from other processors. On the other hand, a core (respectively the core's cache) can obtain data from multiple sources: data can be supplied by the local memory controller or by a remote memory controller on another processor. In this paper we experimentally analyze the behavior of the memory controllers of a commercial multicore processor, the Intel Xeon 5520 (Nehalem). We develop a simple model to characterize the sharing of local and remote memory bandwidth. The uneven treatment of local and remote accesses has implications for mapping applications onto such a NUMA multicore multiprocessor. Maximizing data locality does not always minimize execution time; it may be more advantageous to allocate data on a remote processor (and then to fetch these data via the cross-processor interconnect) than to store the data of all processes in local memory (and consequently overloading the on-chip memory controller).

show abstract

Matching memory access patterns and data placement for NUMA systems

2012

Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and lower throughput than local accesses. This paper reports on a limit study that shows that many scientific loop-parallel programs include multiple, mutually incompatible data access patterns, therefore these programs encounter a high fraction of costly remote memory accesses. Matching the data distribution of a program to the individual data access patterns is possible, however it is difficult to find a data distribution that matches all access patterns.Directives as included in, e.g., OpenMP provide a way to distribute the computation, but the induced data partitioning does not take into account the placement of data into the processors' memory. To alleviate this problem we describe a small set of language-level primitives for memory allocation and loop scheduling. Using the primitives together with simple program-level transformations eliminates mutually incompatible access patterns from OpenMP-style parallel programs. This result represents an improvement of up to 3.3X over the default setup, and the programs obtain a speedup of up to 33.6X over single-core execution (19X on average) on a 4-processor 32-core machine.

show abstract

(Mis)understanding the NUMA memory system performance of multithreaded workloads

2013

A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

ACM Trans. Parallel Comput.

2017

Közlekedéstudományi Szemle

Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This paper presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations.

show abstract

Autonóm járművek, önvezető autók: mit gondol a közönség?

Majó¹,

Huszár²

2020