An improved parallel singular value algorithm and its implementation for multicore hardware

Haidar, Azzam; Kurzak, Jakub; Łuszczek, Piotr

doi:10.1145/2503210.2503292

Cited by 32 publications

(34 citation statements)

References 58 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [15] a rowmajor data layout has been proposed to improve DRAM's bandwidth efficiency and reduce bank conflicts in FPGA's BRAM banks. Also, tile-aware memory layouts have been previously proven effective for multi-core [36] and GPU implementations [37] of linear algebra algorithms, directly affecting their cache performance, bandwidth efficiency, and the degree of parallelism. In this paper, we introduce a general and flexible form called 4D-tiling (subsection IV-A) allowing for optimization of performance and energy efficiency under given constraints such as on-die SPM and DRAM bandwidth usage.…”

Section: B Implementation Challenges Of Modern Convnetsmentioning

confidence: 99%

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

Azarkhish

Rossi

Loi

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-High-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our codesign approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8% of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.

show abstract

Section: B Implementation Challenges Of Modern Convnetsmentioning

confidence: 99%

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

Azarkhish

Rossi

Loi

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…In the experiments, we employed square symmetric matrices for SEVP, and both square and rectangular matrices for the SVD, with random entries uniformly distributed in (0, 1), and dimensions of up to 10000 in steps of 500. We reiterate that the optimal bandwidth w depends not only on the implementation of the first stage, but also on that of the second stage, for which there exist multiple algorithms and tuned implementations, depending on the target architecture [9,18,19,10], the problem size, etc. For this reason, we decided to test the algorithms using six bandwidths: w = {32, 64, 96, 128, 192, 256}.…”

Section: Experimental Evaluationmentioning

confidence: 99%

Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD

et al. 2018

View full text Add to dashboard Cite

We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case we revisit the reduction to symmetric band form while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular-band output. In both cases, we describe algorithmic variants of the standard Level-3 BLAS-based procedures, enhanced with look-ahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure.

show abstract

“…This is in contrast with LAPACK, where one tall panel (block of columns) is eliminated at a time, making it difficult to achieve cache efficiency and apply multithreading. In the course of the PLASMA project, tile algorithms have been developed for a wide range of algorithms, including: Cholesky, LU and QR factorizations [11,14,16], as well as reductions to band forms for solving the singular value problem or the eigenvalue problem [23,31].…”

Section: Plasmamentioning

confidence: 99%

Porting the PLASMA Numerical Library to the OpenMP Standard

YarKhan

Kurzak

Łuszczek

et al. 2016

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.

show abstract

An improved parallel singular value algorithm and its implementation for multicore hardware

Cited by 32 publications

References 58 publications

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD

Porting the PLASMA Numerical Library to the OpenMP Standard

Contact Info

Product

Resources

About