2021
DOI: 10.48550/arxiv.2112.10508
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Sabrina J. Mielke,
Zaid Alyafeai,
Elizabeth Salesky
et al.

Abstract: What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level process… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 22 publications
(21 citation statements)
references
References 80 publications
0
21
0
Order By: Relevance
“…a. Word Tokenization: The raw tweets after preprocessing and cleaning is broken down into smallest recognizable words and punctuations known tokens [38], the goal of which is generate the list of words which eventually is used for word cloud, summarization and sentiment analysis. The accuracy of this task is often influenced by the training vocabulary, unknown words and our-of-vocabulary (OOV) words.…”
Section: Natural Language Processing (Nlp) and Natural Language Under...mentioning
confidence: 99%
“…a. Word Tokenization: The raw tweets after preprocessing and cleaning is broken down into smallest recognizable words and punctuations known tokens [38], the goal of which is generate the list of words which eventually is used for word cloud, summarization and sentiment analysis. The accuracy of this task is often influenced by the training vocabulary, unknown words and our-of-vocabulary (OOV) words.…”
Section: Natural Language Processing (Nlp) and Natural Language Under...mentioning
confidence: 99%
“…An in-depth exploration of this topic is outside the scope of this survey, and we point to Mielke et al's work [6] for an excellent historical review of the evolution of tokenisers over recent years. As both tokenisation and classification approaches evolved in parallel, it is more common to associate conventional methods with pre-tokenisers.…”
Section: Tokenisationmentioning
confidence: 99%
“…Therefore, different approaches have been proposed, of which we highlight the most prominent. As a side note, it is fairly common for modern tokenisers to also apply normalisation operations within their procedures [6].…”
Section: Preprocessing For Deep Modelsmentioning
confidence: 99%
“…Therefore, the positional information for each token can be introduced by concatenating the position encoding vector and the embedding vector. Notably, we adopt naive atom-based tokenization for our task, unlike the popular tokenization strategy 47 used on the translation task in NLP. Our approach has constant small-scale vocabulary for all tasks using SMILES.…”
Section: ■ Approachmentioning
confidence: 99%