2015 Computational Electromagnetics International Workshop (CEM) 2015
DOI: 10.1109/cem.2015.7237429
|View full text |Cite
|
Sign up to set email alerts
|

Comparative benchmarking: matrix multiplication on a multicore coprocessor and a GPU

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
11
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 7 publications
0
11
0
Order By: Relevance
“…KNL and KNM Phis have higher memory capacity than GPU, which allows them to run even those codes which cannot run on GPU. 62,93 The strength of GPU lies in use of massive multithreading and high memory bandwidth. Also, texture units bring large speedup for graphics applications.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…KNL and KNM Phis have higher memory capacity than GPU, which allows them to run even those codes which cannot run on GPU. 62,93 The strength of GPU lies in use of massive multithreading and high memory bandwidth. Also, texture units bring large speedup for graphics applications.…”
Section: Discussionmentioning
confidence: 99%
“…In general, GPU provides higher performance than Phi which, in turn, provides higher performance than CPU. Note that some works do not perform vectorization on Phi 16,32,33,41,76,93 whereas others perform only sequential execution on CPU. 45,99 Table 8 shows the factors causing performance bottleneck on Phi.…”
Section: Comparative Evaluation and Collaborative Executionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, in loop interchange and blocking the data reuses gain much better performance compared to basic and transposed methods which are shown in Figure 2(c) and Figure 2(d) respectively. MMM speedup has been the major goal of many studies [8], [11]- [15] and is still ongoing today. BLAS [13], [16] is a basic linear algebra subprogram (BLAS) that provides a standard blocking method for matrix multiplication.…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, the proposed architecture can be a competitor for Graphic Processing Units (GPU) when the matrix size is <1000. In the bibliography, GPU performance ranged from 26.4 [20] to 350 GFLOPS [21], but this performance was greatly reduced for intermediate matrix sizes, as in [21], where performance dropped from 350 GFLOPS to under 50 GFLOPS when the matrix size was <1000. Besides, the circulant matrix multiplication had better performance than the rest of the FPGA-based approaches operating with matrix sizes N > 50, as [22], who claimed the fastest result for 180 × 180 matrix multiplications.…”
mentioning
confidence: 99%