A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

Ding, Nan; Liu, Yang; Williams, Samuel; Li, Xiaoye S.

doi:10.1137/1.9781611976830.14

Cited by 5 publications

(5 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SuperLU For the multiGPU sparse triangular solve (SpTRSV), we leverage the advantage of GPU-initiated data transfers of NVSHMEM. The new multiGPU SpTRSV implementation using two CUDA streams achieves a 3.7× speedup when using twelve GPUs (two nodes of the Summit supercomputer at ORNL) relative to our implementation on a single GPU, and up to 6.1× compared to NVIDIA's cuSOLVER (cuSPARSE csrsv2) over the range of one to eighteen GPUs [164]. In the new v7.0.0 release of SuperLU_DIST, we released the 3D factorization algorithm, where MPI ranks are arranged as 3D process grid: Px X Py X Pz.…”

Section: Recent Progressmentioning

confidence: 87%

ECP Software Technology Capability Assessment Report V3.0

Heroux

McInnes

et al. 2022

View full text Add to dashboard Cite

The Exascale Computing Project (ECP) Software Technology (ST) focus area is responsible for (1) developing critical software capabilities that will enable the successful execution of ECP applications and (2) providing key components of a productive and sustainable exascale computing ecosystem that will position the US Department of Energy (DOE) and the broader high-performance computing (HPC) community with a firm foundation for future extreme-scale computing capabilities.This ECP ST Capability Assessment Report (CAR) provides an overview and assessment of current ECP ST capabilities and activities, giving stakeholders and the broader HPC community information that can be used to assess ECP ST progress and plan their own efforts accordingly. ECP ST leaders commit to updating this document on regular basis (every 6-12 months). Highlights from this version of the report are presented here.This version of the CAR contains the following updates relative to the previous revision.• This report highlights the progress with the Extreme-scale Scientific Software Stack (E4S) efforts.In particular, this report discusses how E4S continues to gain traction as a first-class entity in the HPC ecosystem, enabling new conversations with users, facilities, vendors, other US agencies, and international partners.• The several-page summaries of each ECP Level 4 project were updated to reflect recent progress and next steps (Section 4). Of particular note are the experiences of our teams on early-access systems for Frontier.• The E4S is described further. E4S is now updated via quarterly releases. E4S is the primary integration and delivery vehicle for ECP ST capabilities (Section 2.1.1).• The ECP ST software development kit (SDK) effort further refined its groupings (Section 2.1.2).The ECP ST focus area represents the key bridge between exascale systems and the scientists developing applications that will run on those platforms. ECP ST efforts contribute to approximately 70 software products (Section 2.1.3) in six technical areas (Table 1). Since publishing the previous revision of the CAR, the team has continued to evolve the product dictionary of official product names, which enables more rigorous mapping of ECP ST deliverables to stakeholders (Section 2.1.4).Programming Models & Runtimes: In addition to developing key enhancements to MPI and OpenMP for scalable systems with accelerated node architectures, the team is working on performance portability layers (Kokkos and RAJA) and participating in OpenMP and OpenACC software design and development that will enable applications to write much of their source code without needing to provide vendor-specific implementations for each exascale system. One legacy of ECP ST efforts is anticipated to be a software stack that supports Intel and AMD accelerators in addition to NVIDIA's accelerators (Section 4.1).Development Tools: The team is enhancing existing widely used compilers (e.g., LLVM) and performance tools for next-generation platforms. Compilers are critical for heterogeneous archi...

show abstract

Section: Recent Progressmentioning

confidence: 87%

ECP Software Technology Capability Assessment Report V3.0

Heroux

McInnes

et al. 2022

View full text Add to dashboard Cite

show abstract

“…They can sometimes scale to thousands of and even millions of cores. [57][58][59] They can distribute the memory footprint, and can fit in the 512 GB HBM3 in a 2026 disaggregated system which is larger than the currently provided node-local DDR (256 GB DDR on a 2021 HPC system). We use GEMM 60 and STREAM 61 as two representative benchmarks to show the implications as the data size grows.…”

Section: Traditional Hpc Workload Bookendsmentioning

confidence: 99%

“…Traditional HPC workloads are designed for distributed‐memory systems. They can sometimes scale to thousands of and even millions of cores 57–59 . They can distribute the memory footprint, and can fit in the 512 GB HBM3 in a 2026 disaggregated system which is larger than the currently provided node‐local DDR (256 GB DDR on a 2021 HPC system).…”

Section: Application Case Studiesmentioning

confidence: 99%

Evaluating the potential of disaggregated memory systems for HPC applications

Ding,

Maris,

Nam

et al. 2024

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SummaryDisaggregated memory is a promising approach that addresses the limitations of traditional memory architectures by enabling memory to be decoupled from compute nodes and shared across a data center. Cloud platforms have deployed such systems to improve overall system memory utilization, but performance can vary across workloads. High‐performance computing (HPC) is crucial in scientific and engineering applications, where HPC machines also face the issue of underutilized memory. As a result, improving system memory utilization while understanding workload performance is essential for HPC operators. Therefore, learning the potential of a disaggregated memory system before deployment is a critical step. This paper proposes a methodology for exploring the design space of a disaggregated memory system. It incorporates key metrics that affect performance on disaggregated memory systems: memory capacity, local and remote memory access ratio, injection bandwidth, and bisection bandwidth, providing an intuitive approach to guide machine configurations based on technology trends and workload characteristics. We apply our methodology to analyze thirteen diverse workloads, including AI training, data analysis, genomics, protein, fusion, atomic nuclei, and traditional HPC bookends. Our methodology demonstrates the ability to comprehend the potential and pitfalls of a disaggregated memory system and provides motivation for machine configurations. Our results show that eleven of our thirteen applications can leverage injection bandwidth disaggregated memory without affecting performance, while one pays a rack bisection bandwidth penalty and two pay the system‐wide bisection bandwidth penalty. In addition, we also show that intra‐rack memory disaggregation would meet the application's memory requirement and provide enough remote memory bandwidth.

show abstract

“…Traditional HPC workloads are designed for distributed-memory systems. They can sometimes scale to thousands of and even millions of cores [39][40][41]. They can distribute the memory footprint, and can fit in the 512 GB HBM3 in a 2026 disaggregated system which is larger than the currently provided node-local DDR (256 GB DDR on a 2021 HPC system).…”

Section: B Application Characteristicsmentioning

confidence: 99%

Methodology for Evaluating the Potential of Disaggregated Memory Systems

Ding

Williams

Nam

et al. 2022

2022 IEEE/ACM International Workshop on Resource Disaggregation in High-Performance Computing (REDIS)

Self Cite

View full text Add to dashboard Cite

Tightly-coupled HPC systems have rigid memory allocation and can result in expensive memory resource underutilization. As novel memory and network technologies mature, disaggregated memory systems are becoming a promising solution for future HPC systems. It allows workloads to use the available memory of the entire system. In this paper, we propose a design framework to explore the disaggregated memory system design space. The framework incorporates memory capacity, network bandwidth, and local and remote memory access ratio, and provides an intuitive approach to guide machine configurations based on technology trends and workload characteristics. We apply our framework to analyze eleven workloads from five computational scenarios, including AI training, data analysis, genomics, protein, and traditional HPC. We demonstrate the ability of our methodology to understand the potential and pitfalls on a disaggregated memory system and motivate machine configurations. Our methodology shows that the 10 out of our 11 applications/workflows can leverage disaggregated memory without affecting performance.

show abstract

A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

Cited by 5 publications

References 32 publications

ECP Software Technology Capability Assessment Report V3.0

ECP Software Technology Capability Assessment Report V3.0

Evaluating the potential of disaggregated memory systems for HPC applications

Methodology for Evaluating the Potential of Disaggregated Memory Systems

Contact Info

Product

Resources

About