Parallel Matrix-Matrix Multiplication Based on HPL with a GPU-Accelerated PC Cluster

Wang, Qin; Ohmura, Junichi; Shan, Axida; Miyoshi, Takefumi; Irie, Hidetsugu; Yoshinaga, Tsutomu

doi:10.1109/ic-nc.2010.39

Cited by 4 publications

(4 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate the energy savings of MM due to overclocking only, undervolting only and combination of overclocking and undervolting considering the power consumption and execution time. We used matrix multiplication application (cuBLAS-MM) as it is a key sub-routine for many scientific applications like HPL and ScaLAPACK [23] [24]. For instance, MM constitutes of more than 90% of the computation cost in HPL [23].…”

Section: Evaluation 41 Experimental Setupmentioning

confidence: 99%

“…We used matrix multiplication application (cuBLAS-MM) as it is a key sub-routine for many scientific applications like HPL and ScaLAPACK [23] [24]. For instance, MM constitutes of more than 90% of the computation cost in HPL [23]. Our proposed method can easily be integrated into these applications to save considerable amount of energy.…”

Section: Evaluation 41 Experimental Setupmentioning

confidence: 99%

See 1 more Smart Citation

Saou

Zamani

Tripathy

Bhuyan

2020

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

View full text Add to dashboard Cite

The current trend of ever-increasing performance in scientific applications comes with tremendous growth in energy consumption. In this paper, we present a framework for GPU applications, which reduces energy consumption in GPUs through Safe Overclocking and Undervolting (SAOU) without sacrificing performance. The idea is to increase the frequency beyond the safe frequency f sa f eMax and undervolt below V sa f eMin to get maximum energy saving. Since such overclocking and undervolting may give rise to faults, we employ an enhanced checkpoint-recovery technique to cover the possible errors. Empirically, we explore different errors and derive a fault model that can set the undervolting and overclocking level for maximum energy saving. We target cuBLAS Matrix Multiplication (cuBLAS-MM) kernel for error correction using the checkpoint and recovery (CR) technique as an example of scientific applications. In case of cuBLAS, SAOU achieves up to 22% energy reduction through undervolting and overclocking without sacrificing the performance.

show abstract

Section: Evaluation 41 Experimental Setupmentioning

confidence: 99%

Section: Evaluation 41 Experimental Setupmentioning

confidence: 99%

Saou

Zamani

Tripathy

Bhuyan

2020

Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

View full text Add to dashboard Cite

show abstract

“…In addition, it gives an introduction to programming them using CUDA, NVIDIA's language for programming GPUs, a standard for programming heterogeneous systems including conventional CPUs and GPUs in [6] and [8].…”

Section: Introductionmentioning

confidence: 99%

Parallel matrix multiplication for various implementations

Taghiyev

Akçay

2013

2013 7th International Conference on Application of Information and Communication Technologies

View full text Add to dashboard Cite

It has become increasingly common to see that supercomputing applications harness the massive parallelism of graphics cards to speed up computations. In this study, an analysis concerning to the time necessity for four different implementations of parallel matrix multiplication is presented. The execution time of parallel matrix multiplications in Compute Unified Device Architecture (CUDA) can be increased to about 10 times than Matlab implementation, 100 times than Java Thread, 300 times than C++ by using duo core Central Processing Unit (CPU) and 600 times than C++ by using single core CPU respectively by our method, as compared with using the fastest tools of GPU-only case or CPU-only case.The goal of this study is to show how to offload parallel computations to the graphics card, when it is necessary, and to give some idea of how to think about code running in the massively parallel environment.

show abstract

“…Therefore, it is important to overlap computation and communication. In our present work [1], we examined an efficient implementation of Linpack. Our approach is based on the Hybrid MPIOpenMP with thread-to-thread communication (Hybrid TC) model introduced by [9].…”

Section: Introductionmentioning

confidence: 99%

Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

Ohmura

Miyoshi

Irie

et al. 2011

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

SUMMARYIn this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU-GPU parallel double-precision general matrix-matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.

show abstract

Parallel Matrix-Matrix Multiplication Based on HPL with a GPU-Accelerated PC Cluster

Cited by 4 publications

References 6 publications

Saou

Saou

Parallel matrix multiplication for various implementations

Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

Contact Info

Product

Resources

About