PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems

Zhang, Yuanxing; Chen, Langshi; Yang, Shiyou; Yuan, Man; Yi, Huimin; Zhang, Jie; Wang, Jiamang; Dong, Jing; Xu, Yan; Song, Yue; Li, Yong; Zhang, Di; Lin, Wei; Q, Lin; Zheng, B.

doi:10.48550/arxiv.2204.04903

Cited by 1 publication

(2 citation statements)

References 21 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Parameter Server [9,20] is one of the most popular architectures for the distributed DLRM training and DMAML [5] customizes the Parameter Server architecture for MAML training in the CPU cluster. However, two update loops in meta learning double the computing and the computation-intensive dense layer becomes more complicated in DLRM [36,38], which makes the CPU time-consuming to compute and requires GPU for acceleration. Nevertheless, Parameter Server is mainly used in the CPU cluster and the design underutilizes the capability of GPU since the embedding layers held in servers are I/O and communicationintensive operators [30].…”

Section: Introductionmentioning

confidence: 99%

“…Secondly, meta learning requires different data management against the traditional deep learning training, and conventional I/O design bottlenecks the training speed [2,24]. Meta learning requires assembling the batch data according to both the task level and batch level when traditional deep learning only requires batch level in the training pipeline [1,36,38]. To be more specific, each worker may hold the batch data from different tasks, but the samples in a batch should belong to the identical task after shuffling for correctness.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Xiao,

Zhao,

Zhou

et al. 2023

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two taskspecific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the GPU cluster, namely G-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48% improvement in Conversion Rate (CVR) and 1.06% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

show abstract