Vector Processors for Energy-Efficient Embedded Systems

Dabbelt, Daniel; Colin, Stéphane; Love, Eric; Mao, Howard; Karandikar, Sagar; Asanović, Krste

doi:10.1145/2934495.2934497

Cited by 17 publications

(30 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In terms of logic area, a Hwacha instance with four lanes uses 0.354 mm 2 [6], or 1098 kGE, which is 19% smaller than the equivalent Ara instance 3 . The trend is also valid for equivalent instances with eight and sixteen lanes.…”

Section: A Methodologymentioning

confidence: 98%

“…These multipliers make up for 9% of the area difference. Moreover, unlike Ara, these specific Hwacha instances do not support mixed-precision arithmetic [6], and its support would incur into a 4% area overhead [34].…”

Section: A Methodologymentioning

confidence: 99%

“…Its flexibility becomes its demise when the model is applied to highly regular applications. In such a case, each core will tend to run the very same instructions, wasting energy by redundant fetch and decode operations [6].…”

Section: A Mimdmentioning

confidence: 99%

“…The lack of a banked cache effectively limits Hwacha's memory bandwidth to 128 bit/cycle, starving the FMA units and severely limiting the achievable performance. Table I brings the performance achieved by Ara and the published results for Hwacha [6] side by side. For a fair comparison, the roofline performance boundaries are identical between the compared architectures.…”

Section: Performance Comparison With Hwachamentioning

confidence: 99%

See 3 more Smart Citations

Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI

Cavalcante

Schuiki

Zaruba

et al. 2020

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GLOBALFOUNDRIES 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 × 256 double precision matrix multiplication on sixteen lanes. Ara runs at 1.2 GHz in the typical corner (TT/0.80 V/25 • C), achieving a performance up to 34 DP−GFLOPS. In terms of energy efficiency, Ara achieves up to 67 DP−GFLOPS/W under the same conditions, which is 56% higher than similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

show abstract

Section: A Methodologymentioning

confidence: 98%

Section: A Methodologymentioning

confidence: 99%

Section: A Mimdmentioning

confidence: 99%

Section: Performance Comparison With Hwachamentioning

confidence: 99%

See 2 more Smart Citations

Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI

Cavalcante

Schuiki

Zaruba

et al. 2020

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

show abstract

“…They also demonstrated the limitations of auto-vectorization over hand-tuned intrinsic-based vectorization for the applications with irregular memory accesses on a Xeon Phi co-processor. Furthermore, the authors in [23] argued for Cray-style temporal vector processing architectures as an attractive means of exploiting parallelism for the future high performance embedded devices.…”

Section: Related Workmentioning

confidence: 99%

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

Hasib

Natvig

Kjeldsberg

et al. 2017

JLPEA

View full text Add to dashboard Cite

Abstract:Thread-level and data-level parallel architectures have become the design of choice in many of today's energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multi-and many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel R Core TM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel R Xeon Phi TM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction sets. Results using a single-threaded execution on the Haswell and Sandy Bridge systems show that performance and EDP (Energy Delay Product) can be improved through data reuse transformations on the scalar code by a factor of ≈3× and ≈6×, respectively. Compared to scalar code without data reuse optimization, the SSE/AVX2 version achieves ≈10×/17× better performance and ≈92×/307× better EDP, respectively. These results can be improved by 10% to 15% using data reuse techniques. Finally, the most optimized version using data reuse and AVX512 achieves a speedup of ≈35× and an EDP improvement of ≈1192× on the Xeon Phi system. While single-threaded execution serves as a common reference point for all architectures to analyze the effects of data reuse on both scalar and vector codes, scalability with thread count is also discussed in the paper.

show abstract

ImSPU: Implicit Sharing of Computation Resources Between Vector and Scalar Processing Units

Tan,

He,

Sun

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Vector Processors for Energy-Efficient Embedded Systems

Cited by 17 publications

References 7 publications

Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI

Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI

Energy Efficiency Effects of Vectorization in Data Reuse Transformations for Many-Core Processors—A Case Study †

ImSPU: Implicit Sharing of Computation Resources Between Vector and Scalar Processing Units

Contact Info

Product

Resources

About