In quest for the thinking machine

Lie, Martin Forsberg

doi:10.22541/au.158145130.06115167

Cited by 1 publication

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Figure 3 we highlight the potential realized gains with unstructured weight sparsity on specialized hardware for deep learning such as the Cerebras CS-2. This figure was regenerated based on the plot in (Lie, 2021).…”

Section: Unstructured Sparsity On Specialized Hardware Acceleratorsmentioning

confidence: 99%

“…In this work, we show how we can leverage weight sparsity to reduce training FLOPs, and then recover the lost representational capacity by shifting to dense weight matrices when fine-tuning on downstream tasks. In addition, while specialized software kernels have been developed to achieve inference acceleration with unstructured sparsity NeuralMagic, 2021;Elsen et al, 2019;Ashby et al, 2019;Wang, 2021), recent work has shown that we can realize the gains of unstructured weight sparsity on specialized hardware (e.g., Cerebras CS-2 (Lie, 2022;2021)) when training LLMs. For example, Lie (2021) shows the measured speedup for a matrix multiplication kernel w.r.t to the sparsity level on a single GPT-3 layer (see Appendix C for more details).…”

Section: Introductionmentioning

confidence: 99%

“…In addition, while specialized software kernels have been developed to achieve inference acceleration with unstructured sparsity NeuralMagic, 2021;Elsen et al, 2019;Ashby et al, 2019;Wang, 2021), recent work has shown that we can realize the gains of unstructured weight sparsity on specialized hardware (e.g., Cerebras CS-2 (Lie, 2022;2021)) when training LLMs. For example, Lie (2021) shows the measured speedup for a matrix multiplication kernel w.r.t to the sparsity level on a single GPT-3 layer (see Appendix C for more details). Therefore, as unstructured sparse training techniques continue to become co-designed with the hardware, we can expect the FLOP reduction to translate into performance and wall-clock speedups.…”

Section: Introductionmentioning

confidence: 99%

“…Figure3: Measured speedup versus theoretical speedup at varying sparsity levels for a GPT-3 layer 12k × 12k matrix multiplication (MatMul)(Lie, 2021).…”

mentioning

confidence: 99%

See 3 more Smart Citations

SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

Thangarasa¹,

Gupta²,

Marshall³

et al. 2023

Preprint

View full text Add to dashboard Cite

The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also leads to highly prohibitive computational costs. Pretraining LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity, and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity while retaining the benefits of pre-trained textual representations for downstream tasks.

show abstract

Section: Unstructured Sparsity On Specialized Hardware Acceleratorsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Figure3: Measured speedup versus theoretical speedup at varying sparsity levels for a GPT-3 layer 12k × 12k matrix multiplication (MatMul)(Lie, 2021).…”

mentioning

confidence: 99%

See 2 more Smart Citations