2021
DOI: 10.48550/arxiv.2103.08490
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-view Subword Regularization

Abstract: Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods (Kudo, 2018;Provilkov et al., 2020) during fine-tuning of pre-trained multilingual rep… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…Subword regularization (Kudo, 2018) and BPE-dropout (Provilkov et al, 2020) recognize that deterministic segmentation during training limits the ability to leverage morphology and word composition; instead, they sample at random one of the multiple tokenizations of the training input, made possible by the inherent ambiguity of subword vocabularies. Wang et al (2021) recently expanded on this paradigm to enforce consistency of predictions over different segmentations. Unigram LM (Kudo, 2018), a segmentation technique that builds its vocabulary top-down, was shown to align with morphology better than BPE on modern pre-trained encoders (Bostrom and Durrett, 2020).…”
Section: Improvements To Subword Tokenizationmentioning
confidence: 99%
“…Subword regularization (Kudo, 2018) and BPE-dropout (Provilkov et al, 2020) recognize that deterministic segmentation during training limits the ability to leverage morphology and word composition; instead, they sample at random one of the multiple tokenizations of the training input, made possible by the inherent ambiguity of subword vocabularies. Wang et al (2021) recently expanded on this paradigm to enforce consistency of predictions over different segmentations. Unigram LM (Kudo, 2018), a segmentation technique that builds its vocabulary top-down, was shown to align with morphology better than BPE on modern pre-trained encoders (Bostrom and Durrett, 2020).…”
Section: Improvements To Subword Tokenizationmentioning
confidence: 99%
“…work with code bases byWolf et al (2020) andWang et al (2021), multilingual BERT (mBERT, Devlin et al (2019)), and the data sets' default splits. Most of the corpora we work with were provided by the Universal Dependency project (UD,Nivre et al (2016)).…”
mentioning
confidence: 99%