2021
DOI: 10.48550/arxiv.2111.09839
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Training Neural Networks with Fixed Sparse Masks

Abstract: During typical gradient-based training of deep neural networks, all of the model's parameters are updated at each iteration. Recent work has shown that it is possible to update only a small subset of the model's parameters during training, which can alleviate storage and communication requirements. In this paper, we show that it is possible to induce a fixed sparse mask on the model's parameters that selects a subset to update over many iterations. Our method constructs the mask out of the k parameters with th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(5 citation statements)
references
References 24 publications
0
5
0
Order By: Relevance
“…FishMask [26] The Fisher is first computed on the training examples and we keep 0.2% or 0.02% of the parameters. Then, these parameters are trained for 1500 steps with a learning rate of 3e −4 .…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…FishMask [26] The Fisher is first computed on the training examples and we keep 0.2% or 0.02% of the parameters. Then, these parameters are trained for 1500 steps with a learning rate of 3e −4 .…”
Section: Discussionmentioning
confidence: 99%
“…Early methods proposed adding adapters [22][23][24], which are small feed-forward networks inserted between the layers in the pre-trained model whose parameters are updated during fine-tuning while the remainder of the pre-trained model is left fixed. Since then, various sophisticated PEFT methods have been proposed, including methods that choose a sparse subset of parameters to train [25,26], produce low-rank updates [13], perform optimization in a lower-dimensional subspace [27], add low-rank adapters using hypercomplex multiplication [28], and more. Relatedly, prompt tuning [14] concatenates learned continuous embeddings to the model's input to induce it to perform a task and can be seen as a PEFT method [29].…”
Section: Parameter-efficient Fine-tuningmentioning
confidence: 99%
See 2 more Smart Citations
“…We call sparsity the fraction of zeroed weights. This kind of non-adaptive pruning is known to largely hinder learning (Frankle et al 2021, Sung et al 2021. In the right panel of figure 2, we report results on sparse binary networks in which we train a MLP with 2 hidden layers of 101 units on the MNIST dataset.…”
Section: Sparse Layersmentioning
confidence: 99%