Abstract. We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card. Key words. Jacobi (H)SVD, parallel pivot strategies, graphics processing units AMS subject classifications. 65Y05, 65Y10, 65F151. Introduction. Graphics processing units have become a widely accepted tool of parallel scientific computing, but many of the established algorithms still need to be redesigned with massive parallelism in mind. Instead of multiple CPU cores, which are fully capable of simultaneously processing different operations, GPUs are essentially limited to many concurrent instructions of the same kind-a paradigm known as SIMT (single-instruction, multiple-threads) parallelism.SIMT type of parallelism is not the only reason for the redesign. Modern CPU algorithms rely on (mostly automatic) multi-level cache management for speedup. GPUs instead offer a complex memory hierarchy, with different access speeds and patterns, and both automatically and programmatically managed caches. Even more so than in the CPU world, a (less) careful hardware-adapted blocking of a GPU algorithm is the key technique by which considerable speedups are gained (or lost).After the introductory paper [29], here we present a family of the full block [21] and the block-oriented [20] one-sided Jacobi-type algorithm variants for the ordinary (SVD) and the hyperbolic singular value decomposition (HSVD) of a matrix, targeting both a single and the multiple GPUs. The blocking of our algorithm follows the levels of the GPU memory hierarchy; namely, the innermost level of blocking tries to maximize the amount of computation done inside the fastest (and smallest) memory of the registers and manual caches. The GPU's global RAM and caches are considered by the mid-level, while inter-GPU communication and synchronization are among the issues addressed by the outermost level of blocking.At each blocking level an instance of either the block-oriented or the full block Jacobi (H)SVD is run, orthogonalizing pivot columns or block-columns by conceptually the same algorithm at the lower level. Thus, the overall structure of the algorithm is hierarchical (or recursive) in nature, and ready to fit not only the cur...