AutoShard: Automated Embedding Table Sharding for Recommender Systems

Zha, Daochen; Liu, Feng; Bhushanam, Bhargav; Choudhary, Dhruv; Nie, Jade; Tian, Yuandong; Chae, Jay; Ma, Yinbin; Kejariwal, Arun; Hu, Xia

doi:10.1145/3534678.3539034

Cited by 14 publications

(7 citation statements)

References 67 publications

(173 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ScratchPipe (Kwon & Rhu, 2022) and RecShard (Sethi et al, 2022) tackle the problem of embedding access latency in hybrid CPU-GPU training systems ScratchPipe addresses the problem through the use of a run-ahead GPUside cache to attempt to have all embedding accesses hit in local GPU HBM, while RecShard uses a mixed-integer linear program and the per embedding table distributions to statically place the most frequently accessed rows in GPU HBM. AutoShard (Zha et al, 2022) focuses on the sharding of embedding tables in a multi-GPU only training system, and uses deep reinforcement-learning and a neural network based cost model to perform its placement decisions.…”

Section: Related Workmentioning

confidence: 99%

“…Furthermore, while the sharding problem has been increasingly explored in recent works due to its importance (Adnan et al, 2021;Lui et al, 2021;Sethi et al, 2022;Zha et al, 2022), they all, to our knowledge, assume that the embedding tables to be sharded are either one-hot-meaning at most one embedding row per table will be accessed per training sample-or sum-pooled-meaning all embedding row accessed within a table by a training sample will be aggregated via summation before proceeding through the model.…”

Section: Introductionmentioning

confidence: 99%

“…Embedding Pooling. A simple and common way to perform pooling is to do element-wise summation across all embeddings gathered to generate the output pooled embedding; known as sum-pooling (Gupta et al, 2020a;Ke et al, 2020;Kwon & Rhu, 2022;Mudigere et al, 2021;Sethi et al, 2022;Zha et al, 2022;Zhou et al, 2019b;Wilkening et al, 2021). The use of sum-pooling to perform embedding aggregation comes with beneficial properties with respect to the sharding problem.…”

Section: Introductionmentioning

confidence: 99%

“…To our knowledge however, prior works which have developed novel solutions to the DLRM sharding problem have done so under the assumption that the embeddings gathered will all be sum-pooled before being consumed by the following model layers (Adnan et al, 2021;Lui et al, 2021;Sethi et al, 2022;Zha et al, 2022). Unfortunately, these benefits do not directly apply to such sequence-based pooling methods where each embedding interacts with every other embedding in the sequence, requiring the full sequence to be present on the requesting GPU to perform the operation.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

Sethi¹,

Bhattacharya²,

Choudhary³

et al. 2023

Preprint

View full text Add to dashboard Cite

Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of nearly 6x over the prior state-of-the-art approach.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

Sethi¹,

Bhattacharya²,

Choudhary³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Reinforcement Learning. Reinforcement learning has shown strong performance in many rewarddriven tasks [38,69,45,34,46,47,19,64,67,23,65,63,61,58,59,62,60,26,10]. It has also been applied to AutoML search [71].…”

Section: Related Workmentioning

confidence: 99%

Towards Personalized Preprocessing Pipeline Search

Martinez¹,

Zha²,

Tan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Feature preprocessing, which transforms raw input features into numerical representations, is a crucial step in automated machine learning (AutoML) systems. However, the existing systems often have a very small search space for feature preprocessing with the same preprocessing pipeline applied to all the numerical features. This may result in sub-optimal performance since di erent datasets often have various feature characteristics, and features within a dataset may also have their own preprocessing preferences. To bridge this gap, we explore personalized preprocessing pipeline search, where the search algorithm is allowed to adopt a di erent preprocessing pipeline for each feature. This is a challenging task because the search space grows exponentially with more features. To tackle this challenge, we propose ClusterP3S, a novel framework for Personalized Preprocessing Pipeline Search via Clustering. The key idea is to learn feature clusters such that the search space can be signi cantly reduced by using the same preprocessing pipeline for the features within a cluster. To this end, we propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines, where the upper-level search optimizes the feature clustering to enable better pipelines built upon the clusters, and the lower-level search optimizes the pipeline given a speci c cluster assignment. We instantiate this idea with a deep clustering network that is trained with reinforcement learning at the upper-level, and random search at the lower level. Experiments on benchmark classi cation datasets demonstrate the e ectiveness of enabling feature-wise preprocessing pipeline search.

show abstract