Proceedings of the ACM International Conference on Supercomputing 2021
DOI: 10.1145/3447818.3460364
|View full text |Cite
|
Sign up to set email alerts
|

Ft-Blas

Abstract: Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fau… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(1 citation statement)
references
References 57 publications
0
1
0
Order By: Relevance
“…GEMM employs a series of architecture-aware optimization strategies, such as cacheand register-level data re-use, prefetching, and vectorization that improve the hardware utilization of a program from a marginal < 1% to a near-optimal efficacy (> 90%) [83,230,207]. To leverage the highly optimized GEMM subroutine, the order of data in memory for spin configurations S α r (t) strategy that fuses the memory footprint of the element-wise operation with the compute-bound GEMM operation to hide the memory latency, is a sound solution that benefits a series of GEMMbased scientific computing and machine-learning applications [263,264]. Therefore, we delve into the black box of GEMM kernels, enabling memory-bandwidth efficient computations for "Daxpy"…”
Section: Gemm Variantmentioning
confidence: 99%
“…GEMM employs a series of architecture-aware optimization strategies, such as cacheand register-level data re-use, prefetching, and vectorization that improve the hardware utilization of a program from a marginal < 1% to a near-optimal efficacy (> 90%) [83,230,207]. To leverage the highly optimized GEMM subroutine, the order of data in memory for spin configurations S α r (t) strategy that fuses the memory footprint of the element-wise operation with the compute-bound GEMM operation to hide the memory latency, is a sound solution that benefits a series of GEMMbased scientific computing and machine-learning applications [263,264]. Therefore, we delve into the black box of GEMM kernels, enabling memory-bandwidth efficient computations for "Daxpy"…”
Section: Gemm Variantmentioning
confidence: 99%