Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.37
|View full text |Cite
|
Sign up to set email alerts
|

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Abstract: Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or subwords and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coars… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(8 citation statements)
references
References 21 publications
0
8
0
Order By: Relevance
“…Others have built hybrid models that use multiple granularities, combining characters with tokens (Luong and Manning, 2016) or different subword vocabularies (Zhang and Li, 2021).…”
Section: Improvements To Subword Tokenizationmentioning
confidence: 99%
“…Others have built hybrid models that use multiple granularities, combining characters with tokens (Luong and Manning, 2016) or different subword vocabularies (Zhang and Li, 2021).…”
Section: Improvements To Subword Tokenizationmentioning
confidence: 99%
“…Joint and hybrid tokenization approaches combine coarse and fine-grained representations to incorporate Word-level and subword representations [Hiraoka et al 2021]. Multi-grained tokenization methods are incorporated into the model architecture to capture multi-word representations, such as ice cream, at the expense of increased computational complexity [Zhang et al 2021a]. Enabling a gradient-based learnable representation in the tokenization step of the pipeline is an emerging line of research [Tay et al 2021].…”
Section: Tokenization Algorithmsmentioning
confidence: 99%
“…Ma et al (2020) uses convolutional neural networks (Kim, 2014) on characters to calculate word representations. Zhang and Li (2020) propose to add phrases into the vocabulary for Chinese pretrained language models. However, they focus on improving the vocabulary of pretrained representations of a single language, and they require modification to the model pretraining stage.…”
Section: Related Workmentioning
confidence: 99%
“…Bostrom and Durrett (2020) empirically compare several popular word segmentation algorithms for pretrained language models of a single language. Several works propose to use different representation granularities, such as phrase-level segmentation (Zhang and Li, 2020) or character-aware representations (Ma et al, 2020) for pretrained language models of a single highresource language, such as English or Chinese only. However, it is not a foregone conclusion that methods designed and tested on monolingual models will be immediately applicable to multilingual representations.…”
Section: Introductionmentioning
confidence: 99%