Comparative benchmarking: matrix multiplication on a multicore coprocessor and a GPU

Salim, Maryam; Akkirman, Ali O.; Hidayetoğlu, Mert; Gürel, L.

doi:10.1109/cem.2015.7237429

Cited by 10 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…KNL and KNM Phis have higher memory capacity than GPU, which allows them to run even those codes which cannot run on GPU. 62,93 The strength of GPU lies in use of massive multithreading and high memory bandwidth. Also, texture units bring large speedup for graphics applications.…”

Section: Discussionmentioning

confidence: 99%

“…In general, GPU provides higher performance than Phi which, in turn, provides higher performance than CPU. Note that some works do not perform vectorization on Phi 16,32,33,41,76,93 whereas others perform only sequential execution on CPU. 45,99 Table 8 shows the factors causing performance bottleneck on Phi.…”

Section: Comparative Evaluation and Collaborative Executionmentioning

confidence: 99%

“…,[12][13][14][16][17][18]20,21,24,[27][28][29][30][33][34][35][36][37]39,[42][43][44][45][46][47][48]50,52,55,57,59,63,66,84,86,90,92,93,95,99 IntelMKL 2,17,19,31,32,40,93,99 …”

mentioning

confidence: 99%

See 2 more Smart Citations

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Comparative Evaluation and Collaborative Executionmentioning

confidence: 99%

See 1 more Smart Citation

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…However, in loop interchange and blocking the data reuses gain much better performance compared to basic and transposed methods which are shown in Figure 2(c) and Figure 2(d) respectively. MMM speedup has been the major goal of many studies [8], [11]- [15] and is still ongoing today. BLAS [13], [16] is a basic linear algebra subprogram (BLAS) that provides a standard blocking method for matrix multiplication.…”

Section: Related Workmentioning

confidence: 99%

Matrix-matrix multiplication on graphics processing unit platform using tiling technique

Balagafshe¹,

Akoushideh

Shahbahrami

2022

IJEECS

View full text Add to dashboard Cite

Today’s hardware platforms have parallel processing capabilities and many parallel programming models have been developed. It is necessary to research an efficient implementation of compute-intensive applications using available platforms. Dense matrix-matrix multiplication is an important kernel that is used in many applications, while it is computationally intensive, especially for large matrix sizes. To improve the performance of this kernel, we implement it on the graphics processing unit (GPU) platform using the tiling technique with different tile sizes. Our experimental results show the tiling approach improves speed by 56.89% (2.32× faster) against straightforward (STF). And tile size of 32 has the highest speed compared to other tile sizes of 8 and 16.

show abstract

“…Moreover, the proposed architecture can be a competitor for Graphic Processing Units (GPU) when the matrix size is <1000. In the bibliography, GPU performance ranged from 26.4 [20] to 350 GFLOPS [21], but this performance was greatly reduced for intermediate matrix sizes, as in [21], where performance dropped from 350 GFLOPS to under 50 GFLOPS when the matrix size was <1000. Besides, the circulant matrix multiplication had better performance than the rest of the FPGA-based approaches operating with matrix sizes N > 50, as [22], who claimed the fastest result for 180 × 180 matrix multiplications.…”

mentioning

confidence: 99%

AnScalable Matrix Computing Unit Architecture for FPGA, and SCUMO User Design Interface

et al. 2019

View full text Add to dashboard Cite

High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N × N matrices, the architecture requires N ALU-RAM blocks and performs O(N 2 ), requiring N 2 + 7 and N + 7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 × 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units.

show abstract

Comparative benchmarking: matrix multiplication on a multicore coprocessor and a GPU

Cited by 10 publications

References 7 publications

A survey on evaluating and optimizing performance of Intel Xeon Phi

A survey on evaluating and optimizing performance of Intel Xeon Phi

Matrix-matrix multiplication on graphics processing unit platform using tiling technique

AnScalable Matrix Computing Unit Architecture for FPGA, and SCUMO User Design Interface

Contact Info

Product

Resources

About