Selfish Sparse RNN Training

Liu, Shiwei; Mocanu, Decebal Constantin; Pei, Yulong; Pechenizkiy, Mykola

doi:10.48550/arxiv.2101.09048

Cited by 6 publications

(5 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[23,24] first introduced the Sparse Evolutionary Training (SET) technique [23], reaching superior performance compared to training with fixed sparse connectivity [72,27]. [28][29][30] leverages "weight reallocation" to improve performance of obtained sparse subnetworks. Furthermore, gradient information from the backward pass is utilized to guide the update of the dynamic sparse connectivity [29,25], which produces substantial performance gains.…”

Section: Related Workmentioning

confidence: 99%

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Chen¹,

Gan²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional posttraining pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We launch and report the first-ofits-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. For additional efficiency gains, we further co-explore data and architecture sparsity, by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can even improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Chen¹,

Gan²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…DST was first proposed in (Mocanu et al, 2018). Following works improve DST by parameter redistribution (Mostafa & Wang, 2019;Liu et al, 2021a) and gradient-based methods (Dettmers & Zettlemoyer, 2019;Evci et al, 2020). A recent work (Liu et al, 2021b) suggested that successful DST needed to explore the training of possible connections sufficiently.…”

Section: Pruning In the Early Training Stagementioning

confidence: 99%

The Elastic Lottery Ticket Hypothesis

Chen¹,

Wang²,

Gan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Lottery Ticket Hypothesis raises keen attention to identifying sparse trainable subnetworks, or winning tickets, at the initialization (or early stage) of training, which can be trained in isolation to achieve similar or even better performance compared to the full models. Despite many efforts being made, the most effective method to identify such winning tickets is still Iterative Magnitudebased Pruning (IMP), which is computationally expensive and has to be run thoroughly for every different network. A natural question that comes in is: can we "transform" the winning ticket found in one network to another with a different architecture, yielding a winning ticket for the latter at the beginning, without re-doing the expensive IMP? Answering this question is not only practically relevant for efficient "oncefor-all" winning ticket finding, but also theoretically appealing for uncovering inherently scalable sparse patterns in networks. We conduct extensive experiments on CIFAR-10 and ImageNet, and propose a variety of strategies to tweak the winning tickets found from different networks of the same model family (e.g., ResNets). Based on these results, we articulate the Elastic Lottery Ticket Hypothesis (E-LTH): by mindfully replicating (or dropping) and re-ordering layers for one network, its corresponding winning ticket could be stretched (or squeezed) into a subnetwork for another deeper (or shallower) network from the same family, whose performance is nearly the same competitive as the latter's winning ticket directly found by IMP. We have also thoroughly compared E-LTH with pruning-at-initialization and dynamic sparse training methods, and discuss the generalizability of E-LTH to different model families, layer types, and even across datasets. Our codes are publicly available at GitHub.

show abstract

“…These methods are all classified as dense-to-sparse training as they start from a dense network. Dynamic Sparse Training (DST) [43,3,47,8,9,35,34,25] is another class of methods that prune models during training. The key factor of DST is that it starts from a random initialized sparse network and optimizes the sparse topology as well as the weights simultaneously during training (sparse-to-sparse training).…”

Section: Related Workmentioning

confidence: 99%

“…We consequently propose a parameter-efficient method to regenerate new connections during the gradual pruning process. Different from the existing works for pruning understanding which mainly focus on dense-to-sparse training [41] (training a dense model and prune it to the target sparsity), we also consider sparse-to-sparse training (training a sparse model yet adaptively re-creating the sparsity pattern) which recently has received an upsurge of interest in machine learning [43,3,9,47,8,36,35].…”

Section: Introductionmentioning

confidence: 99%

Sparse Training via Boosting Pruning Plasticity with Neuroregeneration

Liu¹,

Chen

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter category of methods usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. Based on the insights from pruning plasticity, we design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), and its dynamic sparse training (DST) variant (GraNet-ST). Both of them advance state of the art. Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet. We will release all codes.Preprint. Under review.

show abstract

Selfish Sparse RNN Training

Cited by 6 publications

References 32 publications

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

The Elastic Lottery Ticket Hypothesis

Sparse Training via Boosting Pruning Plasticity with Neuroregeneration

Contact Info

Product

Resources

About