2020
DOI: 10.48550/arxiv.2011.05497
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding Training Efficiency of Deep Learning Recommendation Models at Scale

Abstract: The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of compute cycles at our large-scale datacenters, the use of GPUs came with various challenges due to having both compute-intensive and memory-intensive components. GPU performance and efficiency of these recommendation models are largely affected by model architecture configurat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
10
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

3
2

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 51 publications
0
10
0
Order By: Relevance
“…DeepRecSys [19] optimized scheduling of inference requests across CPUs and GPUs. Acun et al [8] characterized the implications of recommendation models' architectures on GPU trainers. Storage Formats for ML.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…DeepRecSys [19] optimized scheduling of inference requests across CPUs and GPUs. Acun et al [8] characterized the implications of recommendation models' architectures on GPU trainers. Storage Formats for ML.…”
Section: Related Workmentioning
confidence: 99%
“…We do so for two reasons. First, they enable important components and services across a wide breadth of domains, seeing widespread adoption at Facebook [8,[19][20][21]34], Google [12,15,23], Microsoft [18], Baidu [50], and many other hyperscale companies [41,51]. Secondly, training these models, which often consist of trillions of parameters [32,37], places enormous demands on the end-to-end training and data ingestion pipeline.…”
Section: Introductionmentioning
confidence: 99%
“…4a, large scale recommendation models are usually trained with multiple trainers (nodes) working on different partitions of data. Each trainer calculates gradients and synchronize to a centralized parameter server, while the server collects and averages the gradient and sends it back to each trainer for weights updating [39]. For example, Hogwild [40] can be used to implement parallelizing stochastic gradient descent (SGD), leveraging multi-CPU/GPU to perform forward and backward pass.…”
Section: Model Growthmentioning
confidence: 99%
“…Tasks such as click-through rate (CTR) and buythrough rate (BTR) predictions are widely adopted in industrial applications, influencing the ad revenues at billions of dollar level for search engines such as Google, Bing and Baidu [78]. Moreover, 80% of movies watched on Netflix [30] and 60% of videos clicked on YouTube [25] are driven by automatic recommendations; over 40% of user engagement on Pinterest are powered by its Related Pins recommendation module [58]; over half of the Instagram community has visited recommendation based Instagram Explore to discover new content relevant to their interests [12]; up to 35% of Amazon's revenue is driven by recommender systems [18,104]. At Kwai, we also observe that recommendation plays an important role for video sharing-more than 300 million of daily active users explore videos selected by recommender systems from billions of candidates.…”
Section: Introductionmentioning
confidence: 99%
“…Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters Now we get back to (12). Summing (12) from 𝑡 = 0 to 𝑡 = 𝑇 − 1 we obtain…”
mentioning
confidence: 99%