New Dynamic Orderings for the Parallel One–Sided Block-Jacobi SVD Algorithm

Bečka, Martin; Okša, Gabriel; Vajteršic, Marián

doi:10.1142/s0129626415500036

Cited by 21 publications

(21 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When we can no longer store the entire matrix in shared memory, we have to operate on the matrix in the slower global memory. Instead of repeatedly reading and updating the columns one at a time, block algorithms that facilitate cache reuse have been developed [20,21,22]. The main benefit of the block Jacobi algorithm is its high degree of parallelism; however, since we implement a batched routine for independent operations, we will use the serial block Jacobi algorithm for individual matrices and rely on the parallelism of the batch processing.…”

Section: Global Memory One-sided Block Jacobimentioning

confidence: 99%

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

et al. 2018

View full text Add to dashboard Cite

We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.

show abstract

Section: Global Memory One-sided Block Jacobimentioning

confidence: 99%

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

et al. 2018

View full text Add to dashboard Cite

show abstract

“…The algorithm consists of preprocessing (lines 1-3), iteration (lines [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19], and postprocessing (lines [20][21][22][23][24][25]. Here, > 0 is a predetermined convergence criterion and a r i denotes the ith column vector of A .r/ .…”

Section: The Algorithmmentioning

confidence: 99%

“…Table III, cited from [23], shows the number of iterations and sweeps of the OSBJ method with the exact and approximate weights. Table III, cited from [23], shows the number of iterations and sweeps of the OSBJ method with the exact and approximate weights.…”

Section: Proofmentioning

confidence: 99%

Performance analysis and optimization of the parallel one‐sided block Jacobi SVD algorithm with dynamic ordering and variable blocking

Kudo

Yamamoto

Bečka

et al. 2016

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

The one-sided block Jacobi (OSBJ) method is known to be an efficient method for computing the singular value decomposition on a parallel computer. In this paper, we focus on the most recent variant of the OSBJ method, the one with parallel dynamic ordering and variable blocking, and present both theoretical and experimental analyses of the algorithm. In the first part of the paper, we provide a detailed theoretical analysis of its convergence properties. In the second part, based on preliminary performance measurement on the Fujitsu FX10 and SGI Altix ICE parallel computers, we identify two performance bottlenecks of the algorithm and propose new implementations to resolve the problem. Experimental results show that they are effective and can achieve up to 1.8 and 1.4 times speedup of the total execution time on the FX10 and the Altix ICE, respectively. Comparison with the ScaLAPACK SVD routine PDGESVD shows that our OSBJ solver is efficient when solving small to medium sized problems (n < 10000) using modest number (< 100) of computing nodes. this approach. However, from the viewpoint of high performance computing, this approach has two drawbacks. First, the bi-diagonalization step has fine-grained parallelism and requires O.n/ interprocessor communications. This often causes performance bottleneck. Second, half of the computational work in the bi-diagonalization step is done in the form of level-2 BLAS (matrix-vector multiplication). As the level-2 BLAS is a memory-intensive operation and cannot use cache memory efficiently, this tends to lower the performance, especially when the matrix size is large.An alternative to the bi-diagonalization-based method is the one-sided block Jacobi method (OSBJ) [5][6][7][8][9][10][11][12], which recently attracted attention. In this method, one first partitions the input matrix logically into block columns as ‡ In the main loop, about`=.`C 1/ of the total computational work is performed in the form of level-3 BLAS. See

show abstract

“…Among the latter class of methods, the classical ordering, in which the off-diagonal block with the largest Frobenius norm is annihilated at each step, is expected to achieve faster convergence than other orderings, based on the analogy of the point Jacobi methods. This ordering has been extended by Bečka et al to parallel dynamic ordering [3,4], which chooses multiple off-diagonal blocks so that the sum of their squared Frobenius norms is maximal under the constraint that they can be annihilated simultaneously, and annihilates them in parallel. Although this ordering was originally proposed for the block Jacobi SVD (singular value decomposition) method, it should be promising also for the eigenvalue problem because it can attain both fast reduction of the off-diagonal norm and large-grain parallelism at the same time.…”

Section: Introductionmentioning

confidence: 99%

Performance of the parallel block Jacobi method with dynamic ordering for the symmetric eigenvalue problem

2018

View full text Add to dashboard Cite

We investigate the performance of the parallel block Jacobi method for the symmetric eigenvalue problem with dynamic ordering both theoretically and experimentally. First, we present an improved global convergence theorem of the method that takes into account the effect of annihilating multiple blocks at once. Next, we compare the dynamic ordering with two representative parallel cyclic orderings experimentally and show that the former can speedup the convergence for ill-conditioned matrices considerably with little extra cost.

show abstract

New Dynamic Orderings for the Parallel One–Sided Block-Jacobi SVD Algorithm

Cited by 21 publications

References 13 publications

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

Performance analysis and optimization of the parallel one‐sided block Jacobi SVD algorithm with dynamic ordering and variable blocking

Performance of the parallel block Jacobi method with dynamic ordering for the symmetric eigenvalue problem

Contact Info

Product

Resources

About