Energy-efficient and Robust Cumulative Training with Net2Net Transformation

Feng, Aosong; Panda, Priyadarshini

doi:10.1109/ijcnn48605.2020.9207451

Cited by 4 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most research works that investigate efficient training via neural growth focus on how to initialize new neurons/layers. (Aosong and Panda 2020;Li et al 2022;Dong et al 2020). An early study on new neuron initialization employs the random initialization (Istrate et al 2018).…”

Section: Related Work Training Acceleration Via Neural Growthmentioning

confidence: 99%

When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks

Wu,

Wang,

Malepathirana

et al. 2024

AAAI

View full text Add to dashboard Cite

Neural growth is the process of growing a small neural network to a large network and has been utilized to accelerate the training of deep neural networks. One crucial aspect of neural growth is determining the optimal growth timing. However, few studies investigate this systematically. Our study reveals that neural growth inherently exhibits a regularization effect, whose intensity is influenced by the chosen policy for growth timing. While this regularization effect may mitigate the overfitting risk of the model, it may lead to a notable accuracy drop when the model underfits. Yet, current approaches have not addressed this issue due to their lack of consideration of the regularization effect from neural growth. Motivated by these findings, we propose an under/over fitting risk-aware growth timing policy, which automatically adjusts the growth timing informed by the level of potential under/overfitting risks to address both risks. Comprehensive experiments conducted using CIFAR-10/100 and ImageNet datasets show that the proposed policy achieves accuracy improvements of up to 1.3% in models prone to underfitting while achieving similar accuracies in models suffering from overfitting compared to the existing methods.

show abstract

Section: Related Work Training Acceleration Via Neural Growthmentioning

confidence: 99%

When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks

Wu,

Wang,

Malepathirana

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…However, Net2Net randomly selects neurons to be split, and subsequent work [25] addresses this issue by employing a functional steepest-descent approach to determine the optimal subset of neurons for splitting. The pruning technique [26] has also been employed for reusable neural networks [27]. In addition to [27], another notable study introduces the concept of hierarchical pre-training.…”

Section: Related Workmentioning

confidence: 99%

“…The pruning technique [26] has also been employed for reusable neural networks [27]. In addition to [27], another notable study introduces the concept of hierarchical pre-training. This approach effectively reduces both the time required for pre-training and enhances overall performance by leveraging an already pre-trained vision model as an initialization step in the pre-training process.…”

Section: Related Workmentioning

confidence: 99%

Leveraging Neighbor Attention Initialization (NAI) for Efficient Training of Pretrained LLMs

Tan,

Zhang

2024

Electronics

View full text Add to dashboard Cite

In the realm of pretrained language models (PLMs), the exponential increase in computational resources and time required for training as model sizes expand presents a significant challenge. This paper proposes an innovative approach named neighbor attention initialization (NAI) to expedite the training process of larger PLMs by leveraging smaller PLMs through parameter initialization. Our methodology hinges on the hypothesis that smaller PLMs, having already learned fundamental language structures and patterns, can provide a robust foundational knowledge base for larger models, which is called function preserving. Specifically, we present a comprehensive framework detailing the process of transferring learned features on transformer-based language models mainly using the neighbor attention head and neighbor layer. We conducted experiments in GPT-2 and demonstrated that our method yields considerable savings in training costs compared to standard approaches, including learning from scratch and bert2BERT, indicating a notable improvement in training efficiency for large PLMs.

show abstract

“…To handle this problem, some works (Wu et al, , 2020bWang et al, 2019b;Wu et al, 2020a) leverage a functional steepest descent idea to decide the optimal subset of neurons to be split. The pruning technique (Han et al, 2015) is also introduced for reusable neural networks (Feng and Panda, 2020). Recently, hierarchical pre-training is proposed by Feng and Panda (2020), which saves training time and improves performance by initializing the pretraining process with an existing pre-trained vision model.…”

Section: Related Workmentioning

confidence: 99%

“…The pruning technique (Han et al, 2015) is also introduced for reusable neural networks (Feng and Panda, 2020). Recently, hierarchical pre-training is proposed by Feng and Panda (2020), which saves training time and improves performance by initializing the pretraining process with an existing pre-trained vision model. In this paper, we study the reusable pre-trained language model and propose a new method, bert2BERT to accelerate the pre-training of BERT and GPT.…”

Section: Related Workmentioning

confidence: 99%

bert2BERT: Towards Reusable Pretrained Language Models

Chen¹,

Yin²,

Shang³

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model (e.g., BERT BASE ) to a large model (e.g., BERT LARGE ) through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving (Chen et al., 2016) on Transformer-based language model, and further improve it by proposing advanced knowledge for large model's initialization. In addition, a two-stage pre-training method is proposed to further accelerate the training process. We did extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT (Gong et al., 2019) and MSLT (Yang et al., 2020); (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERT BASE and GPT BASE by reusing the models of almost their half sizes. The source code will be publicly available upon publication. † This work is done when Cheng Chen is an intern at Huawei Noah's Ark Lab * https://lambdalabs.com/blog/ demystifying-gpt-3/

show abstract

Energy-efficient and Robust Cumulative Training with Net2Net Transformation

Cited by 4 publications

References 14 publications

When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks

When to Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks

Leveraging Neighbor Attention Initialization (NAI) for Efficient Training of Pretrained LLMs

bert2BERT: Towards Reusable Pretrained Language Models

Contact Info

Product

Resources

About