2022
DOI: 10.48550/arxiv.2204.12184
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach

Abstract: We present SkillNet-NLG, a sparsely activated approach that handles many natural language generation tasks with one model. Different from traditional dense models that always activate all the parameters, SkillNet-NLG selectively activates relevant parts of the parameters to accomplish a task, where the relevance is controlled by a set of predefined skills. The strength of such model design is that it provides an opportunity to precisely adapt relevant skills to learn new tasks effectively. We evaluate on Chine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 10 publications
0
2
0
Order By: Relevance
“…Shazee et al [28] used the MoE, which was also adopted in recurrent neural networks, to increase the model parameters and proposed a gating network to select experts in training depending on inputs. Since different tasks require different skills, Liao et al [29] and Zhang et al [30] replaced the MLP in transformer with multiple experts to understand different knowledge. Due to the particularity of graph data, Zhou et al [31] explored the MoE in graph neural networks to solve the over-smoothing problem, and experiments showed that MoE has its greatest potential with very large datasets.…”
Section: Mixture Of Experts Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…Shazee et al [28] used the MoE, which was also adopted in recurrent neural networks, to increase the model parameters and proposed a gating network to select experts in training depending on inputs. Since different tasks require different skills, Liao et al [29] and Zhang et al [30] replaced the MLP in transformer with multiple experts to understand different knowledge. Due to the particularity of graph data, Zhou et al [31] explored the MoE in graph neural networks to solve the over-smoothing problem, and experiments showed that MoE has its greatest potential with very large datasets.…”
Section: Mixture Of Experts Networkmentioning
confidence: 99%
“…Such a strategy transforms the structure of a multilayer network into modular construction. Structurally, MoE has two parts: (1) an experts network, in which each expert is a feedforward neural network used to learn different knowledge [28], [29], [30], [37]; and (2) a Gating Network that usually adopts the approaches of no-sparse gating [38] and noisy top-K gating [28]. We use no-sparse gating because it is superior [31] when the number of experts is smaller.…”
Section: Mixture Of Experts Layermentioning
confidence: 99%