2021
DOI: 10.48550/arxiv.2110.08246
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Tricks for Training Sparse Translation Models

Abstract: Multi-task learning with an unbalanced data distribution skews model learning towards high resource tasks, especially when model capacity is fixed and fully shared across all tasks. Sparse scaling architectures, such as BASE-Layers, provide flexible mechanisms for different tasks to have a variable number of parameters, which can be useful to counterbalance skewed data distributions. We find that that sparse architectures for multilingual machine translation can perform poorly out of the box, and propose two s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…Because the self-reinforcing circle no longer exists, we can prove that the router will treat different experts in the same M k almost equally and dispatch almost the same amount of data to them (See Section E.2 in Appendix for detail). This Load imbalance issue can be further avoided by adding load balancing loss (Eigen et al, 2013;Shazeer et al, 2017;Fedus et al, 2021), or advanced MoE layer structure such as BASE Layers (Lewis et al, 2021;Dua et al, 2021) and Hash Layers (Roller et al, 2021). Road Map: Here we provide the road map of the proof of Theorem 4.2 and the full proof is presented in Appendix E. The training process can be decomposed into several stages.…”
Section: Overview Of Key Techniquesmentioning
confidence: 99%
See 1 more Smart Citation
“…Because the self-reinforcing circle no longer exists, we can prove that the router will treat different experts in the same M k almost equally and dispatch almost the same amount of data to them (See Section E.2 in Appendix for detail). This Load imbalance issue can be further avoided by adding load balancing loss (Eigen et al, 2013;Shazeer et al, 2017;Fedus et al, 2021), or advanced MoE layer structure such as BASE Layers (Lewis et al, 2021;Dua et al, 2021) and Hash Layers (Roller et al, 2021). Road Map: Here we provide the road map of the proof of Theorem 4.2 and the full proof is presented in Appendix E. The training process can be decomposed into several stages.…”
Section: Overview Of Key Techniquesmentioning
confidence: 99%
“…Since ∆ Θ keeps increasing during the training, it cannot be bounded if we allow the total number of iterations goes to infinity in Algorithm 1. This is the reason that we require early stopping in Theorem 4.2, which we believe can be waived by adding load balancing loss (Eigen et al, 2013;Shazeer et al, 2017;Fedus et al, 2021), or advanced MoE layer structure such as BASE Layers (Lewis et al, 2021;Dua et al, 2021) and Hash Layers (Roller et al, 2021).…”
Section: Define ∆mentioning
confidence: 99%
“…Besides, some work focus on optimizing training methods for MoE models. Dua et al (2021) applied a temperature heating mechanism for sparse MoE models on the translation task. Chi et al (2022) proposed a dimension reduction to estimate the routing scores between tokens and experts on a low-dimensional hyper-sphere.…”
Section: Related Workmentioning
confidence: 99%