VENICE: A compact vector processor for FPGA applications

Severance, Aaron; Lemieux, Guy

doi:10.1109/fpt.2012.6412146

Cited by 37 publications

(23 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Apart from those reviewed earlier, none have been optimised for machine learning applications. Similar to previous soft vector processors [23], [24], [25], [26], the KRLS processor architecture offers scalable [23]. Since all vector memory is on-chip, as the number of lanes are increased, the maximum vector memory depth is reduced [24].…”

Section: A System Overviewmentioning

confidence: 93%

“…With some exceptions, such as [22], most well-known previous soft vector processors [23], [24], [25], [26], have not supported floating-point operations. Apart from those reviewed earlier, none have been optimised for machine learning applications.…”

Section: A System Overviewmentioning

confidence: 99%

“…A vector scratchpad memory is employed as introduced in VEGAS [23] and VINCE [25], to reduce load/store operations from memory. Furthermore, the microcode memory output directly drives the vector memory to eliminate the need for address registers [26].…”

Section: B Memory Interfacementioning

confidence: 99%

See 2 more Smart Citations

A low latency kernel recursive least squares processor using FPGA technology

Pang

Wang

et al. 2013

2013 International Conference on Field-Programmable Technology (FPT)

View full text Add to dashboard Cite

The kernel recursive least squares (KRLS) algorithm performs non-linear regression in an online manner, with similar computational requirements to linear techniques. In this paper, an implementation of the KRLS algorithm utilising pipelining and vectorisation for performance; and microcoding for reusability is described. The design can be scaled to allow tradeoffs between capacity, performance and area. Compared with a central processing unit (CPU) and digital signal processor (DSP), the processor improves on execution time, latency and energy consumption by factors of 5, 5 and 12 respectively.

show abstract

Section: A System Overviewmentioning

confidence: 93%

Section: A System Overviewmentioning

confidence: 99%

See 1 more Smart Citation

A low latency kernel recursive least squares processor using FPGA technology

Pang

Wang

et al. 2013

2013 International Conference on Field-Programmable Technology (FPT)

View full text Add to dashboard Cite

show abstract

“…Examples of such systems include VESPA [1], VEGAS [2], VENICE [3], iDEA [4], [5], Octavo [6], and others [7]- [11]. In general, overlays provide parallelism through "tiling" (duplicating in two dimensions) computing elements such as datapaths and soft processors.…”

Section: Introductionmentioning

confidence: 99%

Maximizing speed and density of tiled FPGA overlays via partitioning

LaForest

Steffan

2013

2013 International Conference on Field-Programmable Technology (FPT)

View full text Add to dashboard Cite

Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having multi-localities (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%.

show abstract

“…Keywords: 15 Field programmable gate arrays 16 Intellectual property 17 Single instruction multiple data 18 System-on- Chip 19 Intensive signal processing 20 2 1 a b s t r a c t 22 Massively parallel architectures are proposed as a promising solution to speed up data-intensive applica-23 tions and provide the required computational power. In particular, Single Instruction Multiple Data 24 (SIMD) many-core architectures have been adopted for multimedia and signal processing applications 25 with massive amounts of data parallelism where both performance and flexible programmability are 26 important metrics.…”

mentioning

confidence: 99%

FPGA-based many-core System-on-Chip design

Baklouti

Marquet

Dekeyser

et al. 2015

Microprocessors and Microsystems

View full text Add to dashboard Cite

International audienceMassively parallel architectures are proposed as a promising solution to speed up data-intensive applications and provide the required computational power. In particular, Single Instruction Multiple Data (SIMD) many-core architec-tures have been adopted for multimedia and signal processing applications with massive amounts of data parallelism where both performance and flexible programmability are important metrics. However, this class of processors has faced many challenges due to its increasing fabrication cost and design complexity. Moreover, the increasing gap between design productivity and chip complexity requires new design methods. Nowadays, the recent evolution of silicon integration technology, on the one hand, and the wide usage of reusable Intellectual Property (IP) cores and FPGAs (Field Pro-grammable Gate Arrays), on the other hand, are attractive solutions to meet these challenges and reduce the time-to-market. The objective of this work is to study the performances of massively parallel SIMD on-chip architecture

show abstract

VENICE: A compact vector processor for FPGA applications

Cited by 37 publications

References 11 publications

A low latency kernel recursive least squares processor using FPGA technology

A low latency kernel recursive least squares processor using FPGA technology

Maximizing speed and density of tiled FPGA overlays via partitioning

FPGA-based many-core System-on-Chip design

Contact Info

Product

Resources

About