AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Zhang, Xinsong; Li, Pengshuai; Li, Hang

doi:10.18653/v1/2021.findings-acl.37

Cited by 21 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Others have built hybrid models that use multiple granularities, combining characters with tokens (Luong and Manning, 2016) or different subword vocabularies (Zhang and Li, 2021).…”

Section: Improvements To Subword Tokenizationmentioning

confidence: 99%

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark

Garrette

Turc

et al. 2022

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.

show abstract

“…Others have built hybrid models that use multiple granularities, combining characters with tokens (Luong and Manning, 2016) or different subword vocabularies (Zhang and Li, 2021).…”

Section: Improvements To Subword Tokenizationmentioning

confidence: 99%

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Clark

Garrette

Turc

et al. 2022

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Joint and hybrid tokenization approaches combine coarse and fine-grained representations to incorporate Word-level and subword representations [Hiraoka et al 2021]. Multi-grained tokenization methods are incorporated into the model architecture to capture multi-word representations, such as ice cream, at the expense of increased computational complexity [Zhang et al 2021a]. Enabling a gradient-based learnable representation in the tokenization step of the pipeline is an emerging line of research [Tay et al 2021].…”

Section: Tokenization Algorithmsmentioning

confidence: 99%

Impact of Tokenization on Language Models: An Analysis for Turkish

Toraman¹,

Yılmaz²,

Şahi̇nuç³

et al. 2022

Preprint

View full text Add to dashboard Cite

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks.Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.

show abstract

“…Ma et al (2020) uses convolutional neural networks (Kim, 2014) on characters to calculate word representations. Zhang and Li (2020) propose to add phrases into the vocabulary for Chinese pretrained language models. However, they focus on improving the vocabulary of pretrained representations of a single language, and they require modification to the model pretraining stage.…”

Section: Related Workmentioning

confidence: 99%

“…Bostrom and Durrett (2020) empirically compare several popular word segmentation algorithms for pretrained language models of a single language. Several works propose to use different representation granularities, such as phrase-level segmentation (Zhang and Li, 2020) or character-aware representations (Ma et al, 2020) for pretrained language models of a single highresource language, such as English or Chinese only. However, it is not a foregone conclusion that methods designed and tested on monolingual models will be immediately applicable to multilingual representations.…”

Section: Introductionmentioning

confidence: 99%

Multi-view Subword Regularization

Wang

Ruder

Neubig

2021

Preprint

View full text Add to dashboard Cite

Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods (Kudo, 2018;Provilkov et al., 2020) during fine-tuning of pre-trained multilingual representations improves the effectiveness of cross-lingual transfer. Second, to take full advantage of different possible input segmentations, we propose Multi-view Subword Regularization (MVR), a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms. 1

show abstract

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Cited by 21 publications

References 21 publications

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Impact of Tokenization on Language Models: An Analysis for Turkish

Multi-view Subword Regularization

Contact Info

Product

Resources

About