SambaNova SN10 RDU: A 7nm Dataflow Architecture to Accelerate Software 2.0

Prabhakar, Raghu; Jairath, Sumti; Shin, Jinuk Luke

doi:10.1109/isscc42614.2022.9731612

Cited by 9 publications

(4 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• After last year releasing some impressive benchmark results for their reconfigurable AI accelerator technology [119] and this year publishing two deeper technology reveals [120], [121] and an applications paper with Argonne National Laboratory [122], SambaNova still has not provided any details from which we can estimate peak performance or power consumption of their solutions. • In May 2022, Intel's Habana Labs announced the second generations of the Goya inference accelerator and Gaudi training accelerator, named Greco and Gaudi2, respectively [123], [124].…”

Section: Survey Of Processorsmentioning

confidence: 99%

AI and ML Accelerator Survey and Trends

Reuther

Michaleas

Jones

et al. 2022

2022 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

This paper updates the survey of AI accelerators and processors from past three years. This paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. Two new trends plots based on accelerator release dates are included in this year's paper, along with the additional trends of some neuromorphic, photonic, and memristor-based inference accelerators.

show abstract

Section: Survey Of Processorsmentioning

confidence: 99%

AI and ML Accelerator Survey and Trends

Reuther

Michaleas

Jones

et al. 2022

2022 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

show abstract

“…The SN40L is a significant improvement over prior RDUs for a wider variety of workloads, and it also improves software usability. The SN10 [63], [64] was mostly focused on training We discuss a few key enhancements below. HBM: SN40L is the first RDU to include HBM.…”

Section: E New Features In Sn40lmentioning

confidence: 99%

“…Companies like Graphcore [50], Cerebras [52], Groq [27], and earlier generations of SambaNova's RDU [63], [64] offer alternate AI accelerators. However, they all lack the three-tier memory system required to execute CoEs efficiently.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SambaNova SN10 RDU:Accelerating Software 2.0 with Dataflow

Prabhakar

Jairath²

2021

2021 IEEE Hot Chips 33 Symposium (HCS)

View full text Add to dashboard Cite

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in computeto-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Recent research has shown that a composition of many smaller expert models, each with several orders of magnitude fewer parameters, can match or exceed the capabilities of monolithic LLMs. Composition of Experts (CoE) is a modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them.In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) -a commercial dataflow accelerator architecture that has been codesigned for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2× to 13× on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19×, speeds up model switching time by 15× to 31×, and achieves an overall speedup of 3.7× over a DGX H100 and 6.6× over a DGX A100.

show abstract