The one-sided block Jacobi (OSBJ) method is known to be an efficient method for computing the singular value decomposition on a parallel computer. In this paper, we focus on the most recent variant of the OSBJ method, the one with parallel dynamic ordering and variable blocking, and present both theoretical and experimental analyses of the algorithm. In the first part of the paper, we provide a detailed theoretical analysis of its convergence properties. In the second part, based on preliminary performance measurement on the Fujitsu FX10 and SGI Altix ICE parallel computers, we identify two performance bottlenecks of the algorithm and propose new implementations to resolve the problem. Experimental results show that they are effective and can achieve up to 1.8 and 1.4 times speedup of the total execution time on the FX10 and the Altix ICE, respectively. Comparison with the ScaLAPACK SVD routine PDGESVD shows that our OSBJ solver is efficient when solving small to medium sized problems (n < 10000) using modest number (< 100) of computing nodes. this approach. However, from the viewpoint of high performance computing, this approach has two drawbacks. First, the bi-diagonalization step has fine-grained parallelism and requires O.n/ interprocessor communications. This often causes performance bottleneck. Second, half of the computational work in the bi-diagonalization step is done in the form of level-2 BLAS (matrix-vector multiplication). As the level-2 BLAS is a memory-intensive operation and cannot use cache memory efficiently, this tends to lower the performance, especially when the matrix size is large.An alternative to the bi-diagonalization-based method is the one-sided block Jacobi method (OSBJ) [5][6][7][8][9][10][11][12], which recently attracted attention. In this method, one first partitions the input matrix logically into block columns as ‡ In the main loop, about`=.`C 1/ of the total computational work is performed in the form of level-3 BLAS. See