Stretching Jacobi: Two-Stage Pivoting in Block-Based Factorization

Thuerck, Daniel

doi:10.1109/ia349570.2019.00014

Cited by 2 publications

(3 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to the higher bandwidth of registers, strictly following this idiom can lead to high speedups. The prime use case for this idiom are batched computations on small portions of data held in registers, e.g., batched GEMM or batched matrix factorizations [5,48]. In those examples, each thread inside a warp loads, e.g., a row of the warps' matrix and uses shuffle whenever it needs access to the row stored by another thread.…”

Section: The Warp Register Cache Idiommentioning

confidence: 99%

“…The warp register cache idiom has become mainstream in the CUDA community [5,16,48,50], having been generalized into "task-parallel programming for warps" [8].…”

Section: Related Workmentioning

confidence: 99%

“…Section 2). While adhering to these two principles limits the expressiveness of CUDA code, the resulting compute kernels can offer a 2-3× speedup over other implementations, e.g., those that use a shared memory approach [48]. Our key insight is that these restrictions also make the implemented compute kernels easier to port to different architectures, while maintaining high performance.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Flynn’s Reconciliation

Thuerck

Weber²,

Bifulco³

2021

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.

show abstract

Section: The Warp Register Cache Idiommentioning

confidence: 99%

“…The warp register cache idiom has become mainstream in the CUDA community [5,16,48,50], having been generalized into "task-parallel programming for warps" [8].…”

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation