“…KNL and KNM Phis have higher memory capacity than GPU, which allows them to run even those codes which cannot run on GPU. 62,93 The strength of GPU lies in use of massive multithreading and high memory bandwidth. Also, texture units bring large speedup for graphics applications.…”
Section: Discussionmentioning
confidence: 99%
“…In general, GPU provides higher performance than Phi which, in turn, provides higher performance than CPU. Note that some works do not perform vectorization on Phi 16,32,33,41,76,93 whereas others perform only sequential execution on CPU. 45,99 Table 8 shows the factors causing performance bottleneck on Phi.…”
Section: Comparative Evaluation and Collaborative Executionmentioning
Summary
Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.
“…KNL and KNM Phis have higher memory capacity than GPU, which allows them to run even those codes which cannot run on GPU. 62,93 The strength of GPU lies in use of massive multithreading and high memory bandwidth. Also, texture units bring large speedup for graphics applications.…”
Section: Discussionmentioning
confidence: 99%
“…In general, GPU provides higher performance than Phi which, in turn, provides higher performance than CPU. Note that some works do not perform vectorization on Phi 16,32,33,41,76,93 whereas others perform only sequential execution on CPU. 45,99 Table 8 shows the factors causing performance bottleneck on Phi.…”
Section: Comparative Evaluation and Collaborative Executionmentioning
Summary
Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.
“…However, in loop interchange and blocking the data reuses gain much better performance compared to basic and transposed methods which are shown in Figure 2(c) and Figure 2(d) respectively. MMM speedup has been the major goal of many studies [8], [11]- [15] and is still ongoing today. BLAS [13], [16] is a basic linear algebra subprogram (BLAS) that provides a standard blocking method for matrix multiplication.…”
Today’s hardware platforms have parallel processing capabilities and many parallel programming models have been developed. It is necessary to research an efficient implementation of compute-intensive applications using available platforms. Dense matrix-matrix multiplication is an important kernel that is used in many applications, while it is computationally intensive, especially for large matrix sizes. To improve the performance of this kernel, we implement it on the graphics processing unit (GPU) platform using the tiling technique with different tile sizes. Our experimental results show the tiling approach improves speed by 56.89% (2.32× faster) against straightforward (STF). And tile size of 32 has the highest speed compared to other tile sizes of 8 and 16.
“…Moreover, the proposed architecture can be a competitor for Graphic Processing Units (GPU) when the matrix size is <1000. In the bibliography, GPU performance ranged from 26.4 [20] to 350 GFLOPS [21], but this performance was greatly reduced for intermediate matrix sizes, as in [21], where performance dropped from 350 GFLOPS to under 50 GFLOPS when the matrix size was <1000. Besides, the circulant matrix multiplication had better performance than the rest of the FPGA-based approaches operating with matrix sizes N > 50, as [22], who claimed the fastest result for 180 × 180 matrix multiplications.…”
High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N × N matrices, the architecture requires N ALU-RAM blocks and performs O(N 2 ), requiring N 2 + 7 and N + 7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 × 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.