Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
This paper investigates the efficient application of half-precision floating-point (FP16) arithmetic on GPUs for boosting LU decompositions in double (FP64) precision. Addressing the motivation to enhance computational efficiency, we introduce two novel algorithms: Pre-Pivoted LU (PRP) and Mixed-precision Panel Factorization (MPF). Deployed in both hybrid CPU-GPU setups and native GPU-only configurations, PRP identifies pivot lists through LU decomposition computed in reduced precision and subsequently reorders matrix rows in FP64 precision before executing LU decomposition without pivoting. Two variants of PRP, namely hPRP and xPRP, are introduced, differing in their computation of pivot lists in full half-precision or mixed half-single precision. The MPF algorithm generates FP64 LU factorization while internally utilizing hPRP for panel factorization, showcasing accuracy on par with standard DGETRF but with superior speed. The study further explores auxiliary functions required for the native mode implementation of PRP variants and MPF.
This paper investigates the efficient application of half-precision floating-point (FP16) arithmetic on GPUs for boosting LU decompositions in double (FP64) precision. Addressing the motivation to enhance computational efficiency, we introduce two novel algorithms: Pre-Pivoted LU (PRP) and Mixed-precision Panel Factorization (MPF). Deployed in both hybrid CPU-GPU setups and native GPU-only configurations, PRP identifies pivot lists through LU decomposition computed in reduced precision and subsequently reorders matrix rows in FP64 precision before executing LU decomposition without pivoting. Two variants of PRP, namely hPRP and xPRP, are introduced, differing in their computation of pivot lists in full half-precision or mixed half-single precision. The MPF algorithm generates FP64 LU factorization while internally utilizing hPRP for panel factorization, showcasing accuracy on par with standard DGETRF but with superior speed. The study further explores auxiliary functions required for the native mode implementation of PRP variants and MPF.
Parker and Lê introduced random butterfly transforms (RBTs) as a preprocessing technique to replace pivoting in dense LU factorization. Unfortunately, their FFT-like recursive structure restricts the dimensions of the matrix. Furthermore, on multi-node systems, efficient management of the communication overheads restricts the matrix’s distribution even more. To remove these limitations, we have generalized the RBT to arbitrary matrix sizes by truncating the dimensions of each layer in the transform. We expanded Parker’s theoretical analysis to generalized RBT, specifically that in exact arithmetic, Gaussian elimination with no pivoting will succeed with probability 1 after transforming a matrix with full-depth RBTs. Furthermore, we experimentally show that these generalized transforms improve performance over Parker’s formulation by up to 62 % while retaining the ability to replace pivoting. This generalized RBT is available in the SLATE numerical software library.
SLATE (Software for Linear Algebra Targeting Exascale) is a distributed, dense linear algebra library targeting both CPU-only and GPU-accelerated systems, developed over the course of the Exascale Computing Project (ECP). While it began with several documents setting out its initial design, significant design changes occurred throughout its development. In some cases, these were anticipated: an early version used a simple consistency flag that was later replaced with a full-featured consistency protocol. In other cases, performance limitations and software and hardware changes prompted a redesign. Sequential communication tasks were parallelized; host-to-host MPI calls were replaced with GPU device-to-device MPI calls; more advanced algorithms such as Communication Avoiding LU and the Random Butterfly Transform (RBT) were introduced. Early choices that turned out to be cumbersome, error prone, or inflexible have been replaced with simpler, more intuitive, or more flexible designs. Applications have been a driving force, prompting a lighter weight queue class, nonuniform tile sizes, and more flexible MPI process grids. Of paramount importance has been building a portable library that works across several different GPU architectures – AMD, Intel, and NVIDIA – while keeping a clean and maintainable codebase. Here we explore the evolving design choices and their effects, both in terms of performance and software sustainability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.