Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2013
DOI: 10.1145/2503210.2503292
|View full text |Cite
|
Sign up to set email alerts
|

An improved parallel singular value algorithm and its implementation for multicore hardware

Abstract: The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance.In this article, we describe a successful methodology to address these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
6
3
1

Relationship

1
9

Authors

Journals

citations
Cited by 32 publications
(34 citation statements)
references
References 58 publications
(87 reference statements)
0
34
0
Order By: Relevance
“…In [15] a rowmajor data layout has been proposed to improve DRAM's bandwidth efficiency and reduce bank conflicts in FPGA's BRAM banks. Also, tile-aware memory layouts have been previously proven effective for multi-core [36] and GPU implementations [37] of linear algebra algorithms, directly affecting their cache performance, bandwidth efficiency, and the degree of parallelism. In this paper, we introduce a general and flexible form called 4D-tiling (subsection IV-A) allowing for optimization of performance and energy efficiency under given constraints such as on-die SPM and DRAM bandwidth usage.…”
Section: B Implementation Challenges Of Modern Convnetsmentioning
confidence: 99%
“…In [15] a rowmajor data layout has been proposed to improve DRAM's bandwidth efficiency and reduce bank conflicts in FPGA's BRAM banks. Also, tile-aware memory layouts have been previously proven effective for multi-core [36] and GPU implementations [37] of linear algebra algorithms, directly affecting their cache performance, bandwidth efficiency, and the degree of parallelism. In this paper, we introduce a general and flexible form called 4D-tiling (subsection IV-A) allowing for optimization of performance and energy efficiency under given constraints such as on-die SPM and DRAM bandwidth usage.…”
Section: B Implementation Challenges Of Modern Convnetsmentioning
confidence: 99%
“…In the experiments, we employed square symmetric matrices for SEVP, and both square and rectangular matrices for the SVD, with random entries uniformly distributed in (0, 1), and dimensions of up to 10000 in steps of 500. We reiterate that the optimal bandwidth w depends not only on the implementation of the first stage, but also on that of the second stage, for which there exist multiple algorithms and tuned implementations, depending on the target architecture [9,18,19,10], the problem size, etc. For this reason, we decided to test the algorithms using six bandwidths: w = {32, 64, 96, 128, 192, 256}.…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…This is in contrast with LAPACK, where one tall panel (block of columns) is eliminated at a time, making it difficult to achieve cache efficiency and apply multithreading. In the course of the PLASMA project, tile algorithms have been developed for a wide range of algorithms, including: Cholesky, LU and QR factorizations [11,14,16], as well as reductions to band forms for solving the singular value problem or the eigenvalue problem [23,31].…”
Section: Plasmamentioning
confidence: 99%