Abstract:Embedding learning is an important technique in deep recommendation models to map categorical features to dense vectors. However, the embedding tables often demand an extremely large number of parameters, which become the storage and efficiency bottlenecks. Distributed training solutions have been adopted to partition the embedding tables into multiple devices. However, the embedding tables can easily lead to imbalances if not carefully partitioned. This is a significant design challenge of distributed systems… Show more
“…ScratchPipe (Kwon & Rhu, 2022) and RecShard (Sethi et al, 2022) tackle the problem of embedding access latency in hybrid CPU-GPU training systems ScratchPipe addresses the problem through the use of a run-ahead GPUside cache to attempt to have all embedding accesses hit in local GPU HBM, while RecShard uses a mixed-integer linear program and the per embedding table distributions to statically place the most frequently accessed rows in GPU HBM. AutoShard (Zha et al, 2022) focuses on the sharding of embedding tables in a multi-GPU only training system, and uses deep reinforcement-learning and a neural network based cost model to perform its placement decisions.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, while the sharding problem has been increasingly explored in recent works due to its importance (Adnan et al, 2021;Lui et al, 2021;Sethi et al, 2022;Zha et al, 2022), they all, to our knowledge, assume that the embedding tables to be sharded are either one-hot-meaning at most one embedding row per table will be accessed per training sample-or sum-pooled-meaning all embedding row accessed within a table by a training sample will be aggregated via summation before proceeding through the model.…”
Section: Introductionmentioning
confidence: 99%
“…Embedding Pooling. A simple and common way to perform pooling is to do element-wise summation across all embeddings gathered to generate the output pooled embedding; known as sum-pooling (Gupta et al, 2020a;Ke et al, 2020;Kwon & Rhu, 2022;Mudigere et al, 2021;Sethi et al, 2022;Zha et al, 2022;Zhou et al, 2019b;Wilkening et al, 2021). The use of sum-pooling to perform embedding aggregation comes with beneficial properties with respect to the sharding problem.…”
Section: Introductionmentioning
confidence: 99%
“…To our knowledge however, prior works which have developed novel solutions to the DLRM sharding problem have done so under the assumption that the embeddings gathered will all be sum-pooled before being consumed by the following model layers (Adnan et al, 2021;Lui et al, 2021;Sethi et al, 2022;Zha et al, 2022). Unfortunately, these benefits do not directly apply to such sequence-based pooling methods where each embedding interacts with every other embedding in the sequence, requiring the full sequence to be present on the requesting GPU to perform the operation.…”
Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of nearly 6x over the prior state-of-the-art approach.
“…ScratchPipe (Kwon & Rhu, 2022) and RecShard (Sethi et al, 2022) tackle the problem of embedding access latency in hybrid CPU-GPU training systems ScratchPipe addresses the problem through the use of a run-ahead GPUside cache to attempt to have all embedding accesses hit in local GPU HBM, while RecShard uses a mixed-integer linear program and the per embedding table distributions to statically place the most frequently accessed rows in GPU HBM. AutoShard (Zha et al, 2022) focuses on the sharding of embedding tables in a multi-GPU only training system, and uses deep reinforcement-learning and a neural network based cost model to perform its placement decisions.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, while the sharding problem has been increasingly explored in recent works due to its importance (Adnan et al, 2021;Lui et al, 2021;Sethi et al, 2022;Zha et al, 2022), they all, to our knowledge, assume that the embedding tables to be sharded are either one-hot-meaning at most one embedding row per table will be accessed per training sample-or sum-pooled-meaning all embedding row accessed within a table by a training sample will be aggregated via summation before proceeding through the model.…”
Section: Introductionmentioning
confidence: 99%
“…Embedding Pooling. A simple and common way to perform pooling is to do element-wise summation across all embeddings gathered to generate the output pooled embedding; known as sum-pooling (Gupta et al, 2020a;Ke et al, 2020;Kwon & Rhu, 2022;Mudigere et al, 2021;Sethi et al, 2022;Zha et al, 2022;Zhou et al, 2019b;Wilkening et al, 2021). The use of sum-pooling to perform embedding aggregation comes with beneficial properties with respect to the sharding problem.…”
Section: Introductionmentioning
confidence: 99%
“…To our knowledge however, prior works which have developed novel solutions to the DLRM sharding problem have done so under the assumption that the embeddings gathered will all be sum-pooled before being consumed by the following model layers (Adnan et al, 2021;Lui et al, 2021;Sethi et al, 2022;Zha et al, 2022). Unfortunately, these benefits do not directly apply to such sequence-based pooling methods where each embedding interacts with every other embedding in the sequence, requiring the full sequence to be present on the requesting GPU to perform the operation.…”
Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of nearly 6x over the prior state-of-the-art approach.
“…Reinforcement Learning. Reinforcement learning has shown strong performance in many rewarddriven tasks [38,69,45,34,46,47,19,64,67,23,65,63,61,58,59,62,60,26,10]. It has also been applied to AutoML search [71].…”
Feature preprocessing, which transforms raw input features into numerical representations, is a crucial step in automated machine learning (AutoML) systems. However, the existing systems often have a very small search space for feature preprocessing with the same preprocessing pipeline applied to all the numerical features. This may result in sub-optimal performance since di erent datasets often have various feature characteristics, and features within a dataset may also have their own preprocessing preferences. To bridge this gap, we explore personalized preprocessing pipeline search, where the search algorithm is allowed to adopt a di erent preprocessing pipeline for each feature. This is a challenging task because the search space grows exponentially with more features. To tackle this challenge, we propose ClusterP3S, a novel framework for Personalized Preprocessing Pipeline Search via Clustering. The key idea is to learn feature clusters such that the search space can be signi cantly reduced by using the same preprocessing pipeline for the features within a cluster. To this end, we propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines, where the upper-level search optimizes the feature clustering to enable better pipelines built upon the clusters, and the lower-level search optimizes the pipeline given a speci c cluster assignment. We instantiate this idea with a deep clustering network that is trained with reinforcement learning at the upper-level, and random search at the lower level. Experiments on benchmark classi cation datasets demonstrate the e ectiveness of enabling feature-wise preprocessing pipeline search.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.