Proceedings of the 49th Annual International Symposium on Computer Architecture 2022
DOI: 10.1145/3470496.3533727
|View full text |Cite
|
Sign up to set email alerts
|

Software-hardware co-design for fast and scalable training of deep learning recommendation models

Abstract: Deep learning recommendation models (DLRMs) have been used across many business-critical services at Meta and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper, we present Neo, a software-hardware co-designed system for high-performance distributed training of large-scale DLRMs. Neo employs a novel 4D parallelism strategy that combines table-wise, row-wise, column-wise, and data parallelism for training massive embedding operators in DLRMs. In addition, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
27
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 61 publications
(27 citation statements)
references
References 48 publications
0
27
0
Order By: Relevance
“…Applicability of single CPU-GPU ScratchPipe. Recent literature from Facebook elaborate on the scale of their deployed RecSys model size, exhibiting upto 100× difference in number of embedding tables per model [61], [62]. Specifically, RecSys models for content filtering are known to be smaller sized than those for ranking [25], enabling the overall memory footprint to fit within a single-node's CPU memory capacity.…”
Section: G Discussionmentioning
confidence: 99%
“…Applicability of single CPU-GPU ScratchPipe. Recent literature from Facebook elaborate on the scale of their deployed RecSys model size, exhibiting upto 100× difference in number of embedding tables per model [61], [62]. Specifically, RecSys models for content filtering are known to be smaller sized than those for ranking [25], enabling the overall memory footprint to fit within a single-node's CPU memory capacity.…”
Section: G Discussionmentioning
confidence: 99%
“…Applicability of single CPU-GPU ScratchPipe. Recent literature from Facebook elaborate on the scale of their deployed RecSys model size, exhibiting upto 100× difference in number of embedding tables per model [14,39]. Specifically, RecSys models for content filtering are known to be smaller sized than those for ranking [15], enabling the overall memory footprint to fit within a singlenode's CPU memory capacity.…”
Section: Discussionmentioning
confidence: 99%
“…Our production recommendation models are built on the opensource DLRM architecture [63]. Modern DLRMs are massive, consisting of over 12 trillion parameters to train, requiring ≈ 1 zetaFLOPs of total compute [59]. To meet the compute requirements of DLRM training, we built the ZionEX hardware platform.…”
Section: Recommendation Model Backgroundmentioning
confidence: 99%
“…A node also contains four CPU sockets, each with a dedicated 100 Gbps NIC connected to our regular datacenter network for data ingestion. We defer readers to [59] for more details on ZionEX.…”
Section: Recommendation Model Backgroundmentioning
confidence: 99%
See 1 more Smart Citation