Stochastic local search: a state-of-the-art review

Kastrati, Muhamet; Biba, Marenglen

doi:10.11591/ijece.v11i1.pp716-727

Cited by 1 publication

(2 citation statements)

References 54 publications

(63 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mertikopoulos et al [91] show that by iteratively reducing the learning rate to 0, the SGD exhibits almost sure convergence, avoids spurious critical points such as saddle points (with probability 1), and stabilizes quickly at local minima. There are a number of variations of the SGD algorithm, which are described below [65].…”

Section: )mentioning

confidence: 99%

See 1 more Smart Citation

Pre-trained Language Models

Paaß

Giesselbach

2023

Artificial Intelligence: Foundations, Theory, and Algorithms

View full text Add to dashboard Cite

This chapter presents the main architecture types of attention-based language models, which describe the distribution of tokens in texts: Autoencoders similar to BERT receive an input text and produce a contextual embedding for each token. Autoregressive language models similar to GPT receive a subsequence of tokens as input. They produce a contextual embedding for each token and predict the next token. In this way, all tokens of a text can successively be generated. Transformer Encoder-Decoders have the task to translate an input sequence to another sequence, e.g. for language translation. First they generate a contextual embedding for each input token by an autoencoder. Then these embeddings are used as input to an autoregressive language model, which sequentially generates the output sequence tokens. These models are usually pre-trained on a large general training set and often fine-tuned for a specific task. Therefore, they are collectively called Pre-trained Language Models (PLM). When the number of parameters of these models gets large, they often can be instructed by prompts and are called Foundation Models. In further sections we described details on optimization and regularization methods used for training. Finally, we analyze the uncertainty of model predictions and how predictions may be explained.

show abstract

Section: )mentioning

confidence: 99%

“…w. This can consume a large additional memory size if the number of parameters approaches the billions. In recent years a number of further optimizers were developed [65]:…”

Section: Variants Of Stochastic Gradient Descentmentioning

confidence: 99%

Pre-trained Language Models

Paaß

Giesselbach

2023

Artificial Intelligence: Foundations, Theory, and Algorithms

View full text Add to dashboard Cite

show abstract

Stochastic local search: a state-of-the-art review

Cited by 1 publication

References 54 publications

Pre-trained Language Models

Pre-trained Language Models

Contact Info

Product

Resources

About