2021
DOI: 10.1109/tc.2020.3027900
|View full text |Cite
|
Sign up to set email alerts
|

Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
47
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 41 publications
(48 citation statements)
references
References 16 publications
0
47
1
Order By: Relevance
“…The accelerator is a 4096-core RISC-V platform that has comparable performance to current machine learning accelerators. It is organized in clusters each with 8 individual single-stage RISC-V cores [ZSHB21], each of which is accompanied by a double precision floating point unit capable of two double precision and four single precision flops per cycle. To hide memory latency, all clusters have access to a scratchpad memory and a large L2 data cache.…”
Section: Acceleratormentioning
confidence: 99%
“…The accelerator is a 4096-core RISC-V platform that has comparable performance to current machine learning accelerators. It is organized in clusters each with 8 individual single-stage RISC-V cores [ZSHB21], each of which is accompanied by a double precision floating point unit capable of two double precision and four single precision flops per cycle. To hide memory latency, all clusters have access to a scratchpad memory and a large L2 data cache.…”
Section: Acceleratormentioning
confidence: 99%
“…A buffer of the same size can be added for output depth slice (1), so that data transfers by the DMA engine run fully in background. 2 In total, roughly 48 KiB are required as buffers.…”
Section: Space Complexitymentioning
confidence: 99%
“…The calls to DmaWait() make sure that the data for the current iteration is present in local memory before the computation. 2 The DMA transfer buffer could be shared by the input depth slices and the filter parameters, since the total amount of data in transfer in the on-chip network does not depend on the type of data being transferred. This would save local memory in the cluster, but it requires the RTE to dynamically partition the DMA transfer buffer between different data types and variables, which is not trivial.…”
Section: Space Complexitymentioning
confidence: 99%
See 1 more Smart Citation
“…The request for extremely efficient devices and the need for processing FP intensive workloads also led to designs that couple a tiny integer processor with bigger, performant FPUs. This is the case of Snitch [8], a tiny 32-bit integer RISC-V core extended with an FP subsystem containing: a multi-format FPU capable of single-cycle executions, optimized for performance and energy efficiency [9]; an FP register file; a set of registers to break long combinatorial paths commonly found around FPUs; a decoder; and a load-store unit. Together Snitch integer core and its FP subsystem form a Snitch core complex (CC).…”
Section: Introduction and Related Workmentioning
confidence: 99%