Findings of the Association for Computational Linguistics: ACL 2023 2023
DOI: 10.18653/v1/2023.findings-acl.38
|View full text |Cite
|
Sign up to set email alerts
|

A Formal Perspective on Byte-Pair Encoding

Abstract: Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1 σ(µ ⋆ ) (1 − e −σ(µ ⋆ ) )-approximation of an optimal merge sequence, where σ(µ ⋆ ) is the total ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(1 citation statement)
references
References 0 publications
0
1
0
Order By: Relevance
“…This procedure is repeated until we fill the predefined vocabulary budget. Zouhar et al (2023) show that this greedy approach is approximately optimal when searching for the vocabulary (merge sequence). We are however interested in intentionally suboptimal vocabularies.…”
Section: E1 Byte-pair Encodingmentioning
confidence: 99%
“…This procedure is repeated until we fill the predefined vocabulary budget. Zouhar et al (2023) show that this greedy approach is approximately optimal when searching for the vocabulary (merge sequence). We are however interested in intentionally suboptimal vocabularies.…”
Section: E1 Byte-pair Encodingmentioning
confidence: 99%