Understanding Training Efficiency of Deep Learning Recommendation Models at Scale

Acun, Bilge; Murphy, M.; Wang, Xiaodong; Nie, Jade; Wu, Carole-Jean; Hazelwood, Kim

doi:10.48550/arxiv.2011.05497

Cited by 5 publications

(10 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DeepRecSys [19] optimized scheduling of inference requests across CPUs and GPUs. Acun et al [8] characterized the implications of recommendation models' architectures on GPU trainers. Storage Formats for ML.…”

Section: Related Workmentioning

confidence: 99%

“…We do so for two reasons. First, they enable important components and services across a wide breadth of domains, seeing widespread adoption at Facebook [8,[19][20][21]34], Google [12,15,23], Microsoft [18], Baidu [50], and many other hyperscale companies [41,51]. Secondly, training these models, which often consist of trillions of parameters [32,37], places enormous demands on the end-to-end training and data ingestion pipeline.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Zhao,

Agarwal,

Basant

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Zhao,

Agarwal,

Basant

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…4a, large scale recommendation models are usually trained with multiple trainers (nodes) working on different partitions of data. Each trainer calculates gradients and synchronize to a centralized parameter server, while the server collects and averages the gradient and sends it back to each trainer for weights updating [39]. For example, Hogwild [40] can be used to implement parallelizing stochastic gradient descent (SGD), leveraging multi-CPU/GPU to perform forward and backward pass.…”

Section: Model Growthmentioning

confidence: 99%

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

Du¹,

Bhushanam²,

Yu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning recommendation systems at scale have provided remarkable gains through increasing model capacity (i.e. wider and deeper neural networks), but it comes at significant training cost and infrastructure cost. Model pruning is an effective technique to reduce computation overhead for deep neural networks by removing redundant parameters. However, modern recommendation systems are still thirsty for model capacity due to the demand for handling big data. Thus, pruning a recommendation model at scale results in a smaller model capacity and consequently lower accuracy. To reduce computation cost without sacrificing model capacity, we propose a dynamic training scheme, namely alternate model growth and pruning, to alternatively construct and prune weights in the course of training. Our method leverages structured sparsification to reduce computational cost without hurting the model capacity at the end of offline training so that a full-size model is available in the recurring training stage to learn new data in real time. To the best of our knowledge, this is the first work to provide in-depth experiments and discussion of applying structural dynamics to recommendation systems at scale to reduce training cost. The proposed method is validated with an open-source deep learning recommendation model (DLRM) and the state-of-theart industrial-scale production models.

show abstract

“…Tasks such as click-through rate (CTR) and buythrough rate (BTR) predictions are widely adopted in industrial applications, influencing the ad revenues at billions of dollar level for search engines such as Google, Bing and Baidu [78]. Moreover, 80% of movies watched on Netflix [30] and 60% of videos clicked on YouTube [25] are driven by automatic recommendations; over 40% of user engagement on Pinterest are powered by its Related Pins recommendation module [58]; over half of the Instagram community has visited recommendation based Instagram Explore to discover new content relevant to their interests [12]; up to 35% of Amazon's revenue is driven by recommender systems [18,104]. At Kwai, we also observe that recommendation plays an important role for video sharing-more than 300 million of daily active users explore videos selected by recommender systems from billions of candidates.…”

Section: Introductionmentioning

confidence: 99%

“…Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters Now we get back to (12). Summing (12) from 𝑡 = 0 to 𝑡 = 𝑇 − 1 we obtain…”

mentioning

confidence: 99%

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Lian¹,

Yuan²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale-from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation-the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies up to 100 trillion parameters have been conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

show abstract

Understanding Training Efficiency of Deep Learning Recommendation Models at Scale

Cited by 5 publications

References 51 publications

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Alternate Model Growth and Pruning for Efficient Training of Recommendation Systems

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Contact Info

Product

Resources

About