Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.170
|View full text |Cite
|
Sign up to set email alerts
|

BPE-Dropout: Simple and Effective Subword Regularization

Abstract: Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
138
1
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 167 publications
(189 citation statements)
references
References 21 publications
4
138
1
1
Order By: Relevance
“…In other words, we replaced the unigram language model in OpTok with the Sentence-Piece tokenizer and used one tokenized sentence as an input to the same architecture. Moreover, many studies have reported that training models with a stochastic tokenization lead to a better performance of the downstream tasks than training a model using deterministic tokenization (Kudo, 2018;Hiraoka et al, 2019;Provilkov et al, 2019). Thus, we trained the encoder and downstream model using subword regularization provided by SentencePiece.…”
Section: Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…In other words, we replaced the unigram language model in OpTok with the Sentence-Piece tokenizer and used one tokenized sentence as an input to the same architecture. Moreover, many studies have reported that training models with a stochastic tokenization lead to a better performance of the downstream tasks than training a model using deterministic tokenization (Kudo, 2018;Hiraoka et al, 2019;Provilkov et al, 2019). Thus, we trained the encoder and downstream model using subword regularization provided by SentencePiece.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Thus, as shown in Figure 1(a), we apply an existing tokenizer to the given sentence, and then input the tokenized sentence into a model for a target downstream task. In the conventional approach, we obtain the most plausible tokenized sentence based on the tokenizer; however, some studies have varied the tokenization using a sampling during the training to enable the downstream model to adapt to various tokenizations (Kudo, 2018;Hiraoka et al, 2019;Provilkov Figure 1: Overview of (a) conventional tokenization and (b) optimizing tokenization proposed herein. We directly optimize the tokenizer to improve the performance of the model for a downstream task using the loss of the target task.…”
Section: Introductionmentioning
confidence: 99%
“…Kudo (2018) introduced the training method of subword regularization. Most recently, the BPEdropout (Provilkov et al, 2019) was introduced which modifies the original BPE's encoding process to enable stochastic segmentation. Our work shares the motivation of exposing diverse subword candidates to the NMT models with previous works but differs in that our method uses gradient signals.…”
Section: Related Workmentioning
confidence: 99%
“…In this regard, Kudo (2018) proposed subword regularization, a training method that exposes multiple segmentations using a unigram language model. Starting from machine translation, it has been shown that subword regularization can improve the robustness of NLP models in various tasks (Kim, 2019;Provilkov et al, 2019;Drexler and Glass, 2019;Müller et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…We use the implementation in YouTokenToMe 6 library. It is fast and offers BPE-dropout (Provilkov et al, 2019) regularization technique.…”
Section: Text Encoding Considerationsmentioning
confidence: 99%