Scalable RDMA performance in PGAS languages

Farreras, Montse; Almási, George; Caşcaval, Călin; Cortés, Toni

doi:10.1109/ipdps.2009.5161025

Cited by 20 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We incur negligible performance loss or marginal performance gain for all benchmarks ( tered until MPI Finalize) slightly improves performance (1.12%) for CG. On a system with a higher deregistration cost, such as Myrinet/GM [7], we expect a larger performance improvement. Figure 11(f) displays the scenario we have shown in section 1.…”

Section: Nas Parallel Benchmarksmentioning

confidence: 98%

See 1 more Smart Citation

Scalable memory registration for high performance networks using helper threads

Liu

Cameron

Nikolopoulos

et al. 2011

Proceedings of the 8th ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

Remote DMA (RDMA) enables high performance networks to reduce data copying between an application and the operating system (OS). However RDMA operations in some high performance networks require communication memory explicitly registered with the network adapter and pinned by the OS. Memory registration and pinning limits the flexibility of the memory system and reduces the amount of memory that user processes can allocate. These issues become more significant on multicore platforms, since registered memory demand grows linearly with the number of processor cores. In this paper we propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures for HPC applications. We hide the cost of dynamic memory management by offloading all dynamic memory registration and deregistration requests to a dedicated memory management helper thread. We investigate design policies and performance implications of the helper thread approach. We evaluate our framework with the NAS parallel benchmarks, for which our registration scheme significantly reduces the registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. We show that our system enables the execution of problem sizes that could not complete under existing memory registration strategies.

show abstract

Section: Nas Parallel Benchmarksmentioning

confidence: 98%

“…They also batch deregistrations to reduce the average cost. Farreras et al [7] proposed a pin-down cache for Myrinet. They delay deregistration and cache registration information for future accesses to the same memory region.…”

Section: Related Workmentioning

confidence: 99%

Scalable memory registration for high performance networks using helper threads

Liu

Cameron

Nikolopoulos

et al. 2011

Proceedings of the 8th ACM International Conference on Computing Frontiers

View full text Add to dashboard Cite

show abstract

“…On the other hand, the IBM XLUPC compiler and runtime system uses a shared variable directory (SVD) to share the location of shared variables. The runtime system employs a local cache to reduce SVD accesses and allow RDMA accesses [11]. This is designed for large scale system and does not particularly address multi-and many-core systems that have lower latency.…”

Section: Related Workmentioning

confidence: 99%

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Serres

Anbar

Merchant

et al. 2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

View full text Add to dashboard Cite

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi-and manycore processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a manycore system, the TILE64 (a 64 core processor) and a dualsocket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

show abstract

“…For Myrinet networks GASNet provides a conduit for the GM driver [5], as does the IBM APGAS runtime [18]. GM is a legacy low-level messaging system for Myrinet network which was replaced by MX.…”

Section: Related Workmentioning

confidence: 99%

Asynchronous PGAS runtime for Myrinet networks

Farreras

Almási

2010

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

View full text Add to dashboard Cite

PGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and performance portability, and it includes optimized implementations for LAPI and Blue Gene DCMF communication sub systems.This paper presents an optimized implementation of the IBM APGAS runtime for Myrinet networks, on top of the MX communication library. It explains the challenges of implementing a one-sided communication model (APGAS) on top of a two-sided communication API such as MX.We show that our implementation outperforms the Berkeley GASNet runtime in terms of latency and bandwidth. We also demonstrate scalability of various HPC benchmarks up to 1024 processes.

show abstract

Scalable RDMA performance in PGAS languages

Cited by 20 publications

References 14 publications

Scalable memory registration for high performance networks using helper threads

Scalable memory registration for high performance networks using helper threads

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Asynchronous PGAS runtime for Myrinet networks

Contact Info

Product

Resources

About