2019
DOI: 10.48550/arxiv.1910.13267
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BPE-Dropout: Simple and Effective Subword Regularization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
23
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(23 citation statements)
references
References 0 publications
0
23
0
Order By: Relevance
“…This is contrary to the usual tradeoff between the two terms. We speculate that for smaller values of β, the noise from the relaxation causes the optimizer to reduce codebook usage toward the beginning of training, resulting in worse ELB at convergence.5 During training, we apply 10% BPE dropout(Provilkov et al, 2019), whose use is common in the neural machine translation literature.6 Strictly speaking, Equation 1 requires us to sample from the categorical distribution specified by the dVAE encoder logits, rather than taking the argmax. In preliminary experiments on ImageNet, we found that this was a useful regularizer in the overparameterized regime, and allows the transformer to be trained using soft targets for the cross-entropy loss.…”
mentioning
confidence: 99%
“…This is contrary to the usual tradeoff between the two terms. We speculate that for smaller values of β, the noise from the relaxation causes the optimizer to reduce codebook usage toward the beginning of training, resulting in worse ELB at convergence.5 During training, we apply 10% BPE dropout(Provilkov et al, 2019), whose use is common in the neural machine translation literature.6 Strictly speaking, Equation 1 requires us to sample from the categorical distribution specified by the dVAE encoder logits, rather than taking the argmax. In preliminary experiments on ImageNet, we found that this was a useful regularizer in the overparameterized regime, and allows the transformer to be trained using soft targets for the cross-entropy loss.…”
mentioning
confidence: 99%
“…For the En ↔ Ru Biomedical translation task, we learn a separate BPE tokenizer solely on our Biomedical Task Data described in 2.2. We use BPE-dropout (Provilkov et al, 2019) of 0.1 for both language pairs and tasks. We post-process En → De model generated translations to replace quotes with their German equivalents -" and ".…”
Section: Data Pre-processing and Post-processingmentioning
confidence: 99%
“…2. Stochastic algorithms (Kudo, 2018;Provilkov et al, 2019) draw multiple segmentations for source and target sequences resorting to randomization to improve robustness and generalization of translation models. 3.…”
Section: Introductionmentioning
confidence: 99%
“…The performance of a standard subword transformer (Vaswani et al, 2017) trained on WMT datasets tokenized using DPE is compared against Byte Pair Encoding (BPE) (Sennrich et al, 2016) and BPE dropout (Provilkov et al, 2019). Empirical results on English ↔ (German, Romanian, Estonian, Finnish, Hungarian) suggest that stochastic subword segmentation is effective for tokenizing source sentences, whereas deterministic DPE is superior for segmenting target sentences.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation