2021
DOI: 10.1016/j.jpdc.2021.02.013
|View full text |Cite
|
Sign up to set email alerts
|

TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 19 publications
(9 citation statements)
references
References 10 publications
0
9
0
Order By: Relevance
“…The challenge emerges when any one of the dimensions is small relative to the other two; in this case, the operational intensity approaches O(1), requiring highly efficient data movement to avoid becoming memory-bound. Such "tall-and-skinny" matrices are difficult to process efficiently on GPUs [24]. While operational intensity can sometimes be addressed by processing multiple inputs simultaneously via batching, this may not be an option for latency-sensitive inference operations where input must be processed as soon as it is received.…”
Section: B Workload Taxonomymentioning
confidence: 99%
“…The challenge emerges when any one of the dimensions is small relative to the other two; in this case, the operational intensity approaches O(1), requiring highly efficient data movement to avoid becoming memory-bound. Such "tall-and-skinny" matrices are difficult to process efficiently on GPUs [24]. While operational intensity can sometimes be addressed by processing multiple inputs simultaneously via batching, this may not be an option for latency-sensitive inference operations where input must be processed as soon as it is received.…”
Section: B Workload Taxonomymentioning
confidence: 99%
“…One challenge to design the algorithm is to simultaneously consider maximizing coalesced global memory access patterns, minimize bank conflict in accessing shared memory, and minimize thread divergence. We use a dynamic data-thread assignment strategy [17][18][19][20][21] to optimize both the accessing and computation of coefficients.…”
Section: Iterative Processing Kernel (Ipk)mentioning
confidence: 99%
“…As the type of processor that contributes the most of the computing parallelism in many current and future HPC systems, Graphics Processing Units (GPUs), equipped with thousands of low-power cores, offer high computational power and energy efficiency. Many applications and libraries have been designed and optimized for GPU accelerators [1,3,8,9,13,25,34,36,42,43]. Benefiting from the fact that GPUs are designed for highly parallelizable computations while CPUs are more efficient with serial computations, CPUs and GPUs that are linked through fast interconnections [30,31] are usually used together to form heterogeneous systems that can efficiently handle a large spectrum of scientific computing workloads.…”
Section: Introductionmentioning
confidence: 99%