2014
DOI: 10.1145/2567932
|View full text |Cite
|
Sign up to set email alerts
|

Ultra-low-power adder stage design for exascale floating point units

Abstract: Currently, the most powerful supercomputers can provide tens of petaflops. Future many-core systems are estimated to provide an exaflop. However, the power budget limitation makes these machines still unfeasible and unaffordable. Floating Point Units (FPUs) are critical from both the power consumption and performance points of view of today's microprocessors and supercomputers. Literature offers very different designs. Some of them are focused on increasing performance no matter the penalty, and others on decr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
7
2

Relationship

3
6

Authors

Journals

citations
Cited by 13 publications
(5 citation statements)
references
References 35 publications
0
4
0
1
Order By: Relevance
“…Table 4 compares the costs among the MAC units of FP32, bfloat16 and the Mitch-w, as synthesized with a 32nm standard library from Synopsys. The Mitch-w6 HDL code is available in [32], the FP32 MAC design is from [33], and we modified the FP32 design to create the bfloat16 MAC. Synopsys Design Compiler automatically synthesized the fixed-point MAC, and Mitch-w6 is followed by an exact fixed-point adder.…”
Section: Comparison Of Costsmentioning
confidence: 99%
“…Table 4 compares the costs among the MAC units of FP32, bfloat16 and the Mitch-w, as synthesized with a 32nm standard library from Synopsys. The Mitch-w6 HDL code is available in [32], the FP32 MAC design is from [33], and we modified the FP32 design to create the bfloat16 MAC. Synopsys Design Compiler automatically synthesized the fixed-point MAC, and Mitch-w6 is followed by an exact fixed-point adder.…”
Section: Comparison Of Costsmentioning
confidence: 99%
“…These files consist of a.xml file defining the layers, sizes and connections and a.bin file determining the weights of each parameter. When executing the optimization, the user can introduce some other configuration parameters such as floating point precision [42] (FP32, FP16, INT8), channel inversion, layer fusing, etc. However, the method used by the MO to optimize the model is not explained in detail, so developers lose some control over the fine-tuning of the model.…”
Section: Architecture Of the Systemmentioning
confidence: 99%
“…The Qualcomm Hexagon DSP processor has been shown to achieve a maximum power efficiency of 58mW/GHz on an architecture with 256KB of RAM, two 32 bit ALUs, two 64 bit vector execution units, and an asynchronous FIFO bus interface [5]. Taking this as a starting point, the logic units could be replaced with single precision FPUs shown to have 1.44mW power consumption without significant additional cost [6]. Combining these low-power techniques, a DICE node would be able to perform single-cycle floating-point operations at 1 GHz while consuming 68mW with a cost of $2.22.…”
Section: Projectionsmentioning
confidence: 99%