2021
DOI: 10.48550/arxiv.2112.12731
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Abstract: Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 [2] was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 13 publications
(21 citation statements)
references
References 65 publications
0
18
0
Order By: Relevance
“…18 About one year after GPT-3 was announced, a spike in similar model announcements followed. These models were developed by both large and small private organizations from around the world: Jurassic-1-Jumbo [46], AI21 Labs, Israel; Ernie 3.0 Titan [70], Baidu, China; Gopher [56], DeepMind, USA/UK; FLAN [71] & LaMDA [68], Google, USA; Pan Gu [78] Huawei, China; Yuan 1.0 [76], Inspur, China; Megatron Turing NLG [64], Microsoft & NVIDIA, USA; and HyperClova [43], Naver, Korea. This suggests that the economic incentives to build such models, and the prestige incentives to announce them, are quite strong.…”
Section: Large Language Models Are Rapidly Proliferatingmentioning
confidence: 99%
See 1 more Smart Citation
“…18 About one year after GPT-3 was announced, a spike in similar model announcements followed. These models were developed by both large and small private organizations from around the world: Jurassic-1-Jumbo [46], AI21 Labs, Israel; Ernie 3.0 Titan [70], Baidu, China; Gopher [56], DeepMind, USA/UK; FLAN [71] & LaMDA [68], Google, USA; Pan Gu [78] Huawei, China; Yuan 1.0 [76], Inspur, China; Megatron Turing NLG [64], Microsoft & NVIDIA, USA; and HyperClova [43], Naver, Korea. This suggests that the economic incentives to build such models, and the prestige incentives to announce them, are quite strong.…”
Section: Large Language Models Are Rapidly Proliferatingmentioning
confidence: 99%
“…Scaling up the amount of data, compute power, and model parameters of neural networks has recently led to the arrival (and real world deployment) of capable generative models such as CLIP [55], Ernie 3.0 Titan [70], FLAN [71], Gopher [56], GPT-3 [11], HyperClova [43], Jurassic-1-Jumbo [46], Megatron Turing NLG [64], LaMDA [68], Pan Gu [78], Yuan 1.0 [76], and more. For this class of models 4 the relationship between scale and model performance is often so predictable that it can be described in a lawful relationship -a scaling law.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, large-scale neural networks have excelled in many machine learning tasks, such as natural language processing(NLP) [1,2,3] and computer vision(CV) [4]. At the same time, the parameter scale of the model has expanded from tens of billions of parameters, such as the GPT-3 model with 175B parameters [2,5], Ernie3.0 Titan with 260B parameters [6] and Megatron-Turing NLG with 530B parameters [7]. However, these densely activated models require abundant computing resources and massive training time.…”
Section: Introductionmentioning
confidence: 99%
“…The rapid evolution of pre-trained models are toward the trend of involving more and higher-quality data, larger amount of parameters and stronger computing power [1,2,3,4,5,6,7,8,9,10]. It naturally relies on increasingly more compute cost and time, e.g.…”
Section: Introductionmentioning
confidence: 99%
“…training the BERT-large 345 million model took 6.16E PF-days, training the GPT-3 175 billion model consumed 3.64E+03E PF-days [3]. On the one hand, larger models tend to obtain better performances, especially on few-and zero-shot learning tasks [6,10], which can greatly empower the AI industry. On the other hand, the increasing demand of compute brings about challenges and triggers more explorations on the development of advanced distributed training techniques as well as the optimization of large-scale resource scheduling and allocation strategies.…”
Section: Introductionmentioning
confidence: 99%