Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Welch, Charles; Mihalcea, Rada; Kummerfeld, Jonathan K.

doi:10.18653/v1/2020.emnlp-main.696

Cited by 2 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the input embeddings of the newly added tokens, we adopt the approach of using the average embeddings of the subword tokens that make up these new tokens as in (Hewitt, 2021;Welch et al, 2020). This method utilizes the semantic richness of the model's existing subword embeddings to offer a meaningful starting point for the new tokens' representations.…”

Section: Preliminary 2: Subword-based Embeddings Initializationmentioning

confidence: 99%

North Korean Defectors and Human Resource Development in South Korea

Kim

You

Choi

et al. 2020

Human Resource Development in South Korea

View full text Add to dashboard Cite

This report introduces EEVE-Korean-v1.0, a Korean adaptation of large language models that exhibit remarkable capabilities across English and Korean text understanding. Building on recent highly capable but Englishcentric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts are inefficiently processed with English-centric tokenizers, we present an efficient and effective vocabulary expansion (EEVE) method, which encompasses parameter freezing and subword initialization. In contrast to previous efforts that believe new embeddings require trillions of training tokens, we show that our method can significantly boost non-English proficiency within just 2 billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM Leaderboard, as of January 2024, our model EEVE-Korean-10.8B-v1.0 ranks as the leading Korean pre-trained model in the open-source community, according to Hugging Face's leaderboard. We open-source our models on Huggingface to empower the open research community in various languages.

show abstract

Section: Preliminary 2: Subword-based Embeddings Initializationmentioning

confidence: 99%

North Korean Defectors and Human Resource Development in South Korea

Kim

You

Choi

et al. 2020

Human Resource Development in South Korea

View full text Add to dashboard Cite

show abstract

“…We explored various hyperparameter configurations on our validation set and found the best results using dropout with the same mask for generic and demographic-specific embeddings, untied weights, and fixed input embeddings. Untying and fixing input embeddings is supported by concurrent work (Welch et al, 2020b). Each model is trained for 50 epochs.…”

Section: Language Modelingmentioning

confidence: 99%

Compositional Demographic Word Embeddings

Welch

Kummerfeld

Pérez-Rosas

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.

show abstract

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Cited by 2 publications

References 22 publications

North Korean Defectors and Human Resource Development in South Korea

North Korean Defectors and Human Resource Development in South Korea

Compositional Demographic Word Embeddings

Contact Info

Product

Resources

About