GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Gremse, Felix; Höfter, Andreas; Schwen, Lars Ole; Kießling, Fabian; Naumann, Uwe

doi:10.1137/130948811

Cited by 75 publications

(64 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We derive the scattering map by automated segmentation of the µCT data and assigning known scattering coefficients of several tissue types (lung, bone, skin, fat, and remaining soft tissue) 24 . Subsequently, we reconstruct an absorption map from the optical raw data which is particularly important for well-perfused organs such as the heart and the liver 13,20 .…”

Section: Discussionmentioning

confidence: 99%

Hybrid µCT-FMT imaging and image analysis

Gremse

Doleschel

Zafarnia

et al. 2015

JoVE

Self Cite

View full text Add to dashboard Cite

Fluorescence-mediated tomography (FMT) enables longitudinal and quantitative determination of the fluorescence distribution in vivo and can be used to assess the biodistribution of novel probes and to assess disease progression using established molecular probes or reporter genes. The combination with an anatomical modality, e.g., micro computed tomography (µCT), is beneficial for image analysis and for fluorescence reconstruction. We describe a protocol for multimodal µCT-FMT imaging including the image processing steps necessary to extract quantitative measurements. After preparing the mice and performing the imaging, the multimodal data sets are registered. Subsequently, an improved fluorescence reconstruction is performed, which takes into account the shape of the mouse. For quantitative analysis, organ segmentations are generated based on the anatomical data using our interactive segmentation tool. Finally, the biodistribution curves are generated using a batchprocessing feature. We show the applicability of the method by assessing the biodistribution of a well-known probe that binds to bones and joints.

show abstract

Section: Discussionmentioning

confidence: 99%

Hybrid µCT-FMT imaging and image analysis

Gremse

Doleschel

Zafarnia

et al. 2015

JoVE

Self Cite

View full text Add to dashboard Cite

show abstract

“…There has been a flurry of activity in developing algorithms and implementations of SpGEMM for Graphics Processing Units (GPUs). Among those, the algorithm of Gremse et al [26] uses the row-wise formulation of SpGEMM. By contrast, Dalton et al [18] uses the data-parallel ESC (expansion, sorting, and contraction) formulation, which is based on outer products.…”

Section: Notation Letmentioning

confidence: 99%

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Azad¹,

Ballard²,

Buluç³

et al. 2016

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

International audienceSparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdös--Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.Read More: http://epubs.siam.org/doi/10.1137/15M104253

show abstract

“…The single precision and double precision absolute performance of the SpGEMM algorithms that compute C = A 2 are shown in Figures 8 and 9, respectively. Four GPU methods from CUSP v0.4.0, cuSPARSE v6.5, RMerge [16] and bhSPARSE are evaluated on three GPUs: nVidia GeForce GTX Titan Black, nVidia GeForce GTX 980 and AMD Radeon R9 290X. One CPU method in Intel MKL v11.0 is evaluated on Intel Xeon E5-2630 CPU.…”

Section: Performance Comparison For Matrix Squaringmentioning

confidence: 99%

“…Previous GPU SpGEMM methods [2,12,13,14,15,16] have proposed a few solutions for the above problems and demonstrated relatively good time and space complexity. However, the experimental results showed that they either only work best for fairly regular sparse matrices [12,13,16], or bring extra high memory overhead for matrices with some specific sparsity structures [2,14,15]. Moreover, in the usual sense, none of these methods can constantly outperform well optimized SpGEMM approach [17] for multicore CPUs.…”

Section: Introductionmentioning

confidence: 99%

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors

Liu

Vinter

2015

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

h i g h l i g h t s• We design a framework for SpGEMM on modern manycore processors using the CSR format.• We present a hybrid method for pre-allocating the resulting sparse matrix.• We propose an efficient parallel insert method for long rows of the resulting matrix.• We develop a heuristic-based load balancing strategy. • Our approach significantly outperforms other known CPU and GPU SpGEMM methods. a b s t r a c tGeneral sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices.In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages.Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory.

show abstract

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Cited by 75 publications

References 31 publications

Hybrid µCT-FMT imaging and image analysis

Hybrid µCT-FMT imaging and image analysis

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors

Contact Info

Product

Resources

About

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Cited by 75 publications

References 31 publications

Hybrid &#181;CT-FMT imaging and image analysis

Hybrid &#181;CT-FMT imaging and image analysis

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors

Contact Info

Product

Resources

About

Hybrid µCT-FMT imaging and image analysis

Hybrid µCT-FMT imaging and image analysis