Languages Through the Looking Glass of BPE Compression

Gutierrez-Vasques, Ximena; Bentz, Christian; Samardžić, Tanja

doi:10.1162/coli_a_00489

Cited by 4 publications

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models

Lindsey,

Pershing,

Habib

et al. 2024

Preprint

View full text Add to dashboard Cite

Genomic language models have recently emerged as powerful tools to decode and interpret genetic sequences. Existing genomic language models have utilized various tokenization methods including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic models have significant differences from natural language and protein language models because of their low character variability, complex and overlapping features, and inconsistent directionality. These differences make sub-word tokenization in genomic language models significantly different from traditional language models. This study explores the impact of tokenization in attention-based and state-space genomic language models by evaluating their downstream performance on various fine-tuning tasks. We propose new definitions for fertility, the token per word ratio, in the context of genomic language models, and introduce tokenization parity, which measures how consistently a tokenizer parses homologous sequences. We also perform an ablation study on the state-space model, Mamba, to evaluate the impact of character-based tokenization compared to byte-pair encoding. Our results indicate that the choice of tokenizer significantly impacts model performance and that when experiments control for input sequence length, character tokenization is the best choice in state-space models for all evaluated task categories except epigenetic mark prediction.

show abstract

A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models

Lindsey,

Pershing,

Habib

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Still No Evidence for an Effect of the Proportion of Non-Native Speakers on Natural Language Complexity

Koplenig

2024

Entropy

View full text Add to dashboard Cite

In a recent study, I demonstrated that large numbers of L2 (second language) speakers do not appear to influence the morphological or information-theoretic complexity of natural languages. This paper has three primary aims: First, I address recent criticisms of my analyses, showing that the points raised by my critics were already explicitly considered and analysed in my original work. Furthermore, I show that the proposed alternative analyses fail to withstand detailed examination. Second, I introduce new data on the information-theoretic complexity of natural languages, with the estimates derived from various language models—ranging from simple statistical models to advanced neural networks—based on a database of 40 multilingual text collections that represent a wide range of text types. Third, I re-analyse the information-theoretic and morphological complexity data using novel methods that better account for model uncertainty in parameter estimation, as well as the genealogical relatedness and geographic proximity of languages. In line with my earlier findings, the results show no evidence that large numbers of L2 speakers have an effect on natural language complexity.

show abstract

A Cross-linguistic Analysis of the Effects of Character-level Information in Neural Models

Kurosawa,

Yanaka

2024

Journal of Natural Language Processing

View full text Add to dashboard Cite

Languages Through the Looking Glass of BPE Compression

Cited by 4 publications

References 85 publications

A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models

A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models

Still No Evidence for an Effect of the Proportion of Non-Native Speakers on Natural Language Complexity

A Cross-linguistic Analysis of the Effects of Character-level Information in Neural Models

Contact Info

Product

Resources

About