Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1141
|View full text |Cite
|
Sign up to set email alerts
|

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

Abstract: Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that-given a fixed vocabulary size budget-the fewer to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 24 publications
(16 citation statements)
references
References 30 publications
0
16
0
Order By: Relevance
“…Domingo et al (2018) performed further experiments to investigate the effects of tokenization on neural machine translation, but used a shared BPE vocabulary across all experiments. Gallé (2019) examined algorithms in the BPE family, but did not compare to unigram language modeling.…”
Section: Introductionmentioning
confidence: 99%
“…Domingo et al (2018) performed further experiments to investigate the effects of tokenization on neural machine translation, but used a shared BPE vocabulary across all experiments. Gallé (2019) examined algorithms in the BPE family, but did not compare to unigram language modeling.…”
Section: Introductionmentioning
confidence: 99%
“…Ding et al (2019) find that smaller numbers of subword merges give better performance for low-resource language pairs, as in a low-resource domain segmentation must be more aggressive for individual subwords to remain frequent. Gallé (2019) and Salesky et al (2020) similarly find that BPE is most effective when giving set coverage with high-frequency subwords while keeping the overall sequence lengths short. We note that this idea of very frequent or rare tokens upsetting model performance becomes important when selecting data for adaptation (section 4).…”
Section: Subword Vocabulariesmentioning
confidence: 84%
“…As a possible explanation for the good performance of BPE, Gallé (2019) claims that the performance of BPE is related to its compression capac-ity: with respect to members from the same class of compression algorithm, BPE performs close to the top in data compression benchmarks.…”
Section: Comparing Morphological Segmentation To Bpe and Friendsmentioning
confidence: 99%
“…In language modeling, vanMerriënboer et al (2017) were the first to apply BPE to language modeling and show that a BPE-based baseline beat all stateof-the-art and even their proposed model on some languages, but the idea didn't really take off until really put to the test by state-of-the-art models like the GPT models(Radford et al, 2018) and BERT(Devlin et al, 2018).24 SeeGallé (2019) for more historical connection and corresponding analyses, e.g., its linear-time implementation byLarsson and Moffat (2000).…”
mentioning
confidence: 99%