Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Mielke, Sabrina J.; Alyafeai, Zaid; Salesky, Elizabeth; Raffel, Colin; Dey, Manan; Gallé, Matthias; Raja, Arun; Si, Chenglei; Lee, Wilson Y.; Sagot, Benoît; Tan, Samson

doi:10.48550/arxiv.2112.10508

Cited by 22 publications

(21 citation statements)

References 80 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…a. Word Tokenization: The raw tweets after preprocessing and cleaning is broken down into smallest recognizable words and punctuations known tokens [38], the goal of which is generate the list of words which eventually is used for word cloud, summarization and sentiment analysis. The accuracy of this task is often influenced by the training vocabulary, unknown words and our-of-vocabulary (OOV) words.…”

Section: Natural Language Processing (Nlp) and Natural Language Under...mentioning

confidence: 99%

From Twitter to Aso-Rock: A Natural Language Processing Spotlight for Understanding Nigeria 2023 Presidential Election

Olabanjo¹,

Wusu²,

Asokere³

et al. 2022

Preprint

View full text Add to dashboard Cite

Introduction: Social media platforms such as Facebook, LinkedIn, Twitter, among others have been used as a tool for staging protests, opinion polls, campaign strategy, medium of agitation and a place of interest expression especially during elections. Past studies have established people’s opinion elections using social media posts. The advent of state-of-the-art algorithms for unstructured text processing implies tremendous progress in natural language processing and understanding. Aim: In this work, a Natural Language framework is designed to understand Nigeria 2023 presidential election based on public opinion using Twitter dataset. Methods: Raw datasets concerning discourse around Nigeria 2023 elections from Twitter of 2,059,113 18 dimensions were collected. Sentiment analysis was performed on the preprocessed dataset using three different machine learning models namely: Long Short-Term Memory (LSTM) Recurrent Neural Network, Bidirectional Encoder Representations from Transformers (BERT) and Linear Support Vector Classifier (LSVC) models. Personal tweet analysis of the three candidates provided insight on their campaign strategies and personalities while public tweet analysis established the public’s opinion about them. The performance of the models was also compared using accuracy, recall, false positive rate, precision and F-measure. Results: LSTM model gave an accuracy, precision, recall, AUC and f-measure of 88%, 82.7%, 87.2% , 87.6% and 82.9% respectively; the BERT model gave an accuracy, precision, recall, AUC and f-measure of 94%, 88.5%, 92.5%, 94.7% and 91.7% respectively while the LSVC model gave an accuracy, precision, recall, AUC and f-measure of 73%, 81.4%, 76.4%, 81.2% and 79.2% respectively. Conclusion: The experimental results show that sentiment analysis and other Natural Language Processing tasks can aid in the understanding of the social media space. Results also revealed the leverage of each aspirant towards winning the election. We conclude that sentiment analysis can form a general basis for generating insights for election and modeling election outcomes.

show abstract

Section: Natural Language Processing (Nlp) and Natural Language Under...mentioning

confidence: 99%

From Twitter to Aso-Rock: A Natural Language Processing Spotlight for Understanding Nigeria 2023 Presidential Election

Olabanjo¹,

Wusu²,

Asokere³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…An in-depth exploration of this topic is outside the scope of this survey, and we point to Mielke et al's work [6] for an excellent historical review of the evolution of tokenisers over recent years. As both tokenisation and classification approaches evolved in parallel, it is more common to associate conventional methods with pre-tokenisers.…”

Section: Tokenisationmentioning

confidence: 99%

“…Therefore, different approaches have been proposed, of which we highlight the most prominent. As a side note, it is fairly common for modern tokenisers to also apply normalisation operations within their procedures [6].…”

Section: Preprocessing For Deep Modelsmentioning

confidence: 99%

A Survey on Text Classification Algorithms: From Text to Predictions

et al. 2022

View full text Add to dashboard Cite

In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models.

show abstract

“…Therefore, the positional information for each token can be introduced by concatenating the position encoding vector and the embedding vector. Notably, we adopt naive atom-based tokenization for our task, unlike the popular tokenization strategy 47 used on the translation task in NLP. Our approach has constant small-scale vocabulary for all tasks using SMILES.…”

Section: ■ Approachmentioning

confidence: 99%

Self-Supervised Molecular Pretraining Strategy for Low-Resource Reaction Prediction Scenarios

Cai²,

Zhang

et al. 2022

J. Chem. Inf. Model.

View full text Add to dashboard Cite

In the face of low-resource reaction training samples, we construct a chemical platform for addressing small-scale reaction prediction problems. Using a self-supervised pretraining strategy called MAsked Sequence to Sequence (MASS), the Transformer model can absorb the chemical information of about 1 billion molecules and then fine-tune on a small-scale reaction prediction. To further strengthen the predictive performance of our model, we combine MASS with the reaction transfer learning strategy. Here, we show that the average improved accuracies of the Transformer model can reach 14.07, 24.26, 40.31, and 57.69% in predicting the Baeyer–Villiger, Heck, C–C bond formation, and functional group interconversion reaction data sets, respectively, marking an important step to low-resource reaction prediction.

show abstract

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Cited by 22 publications

References 80 publications

From Twitter to Aso-Rock: A Natural Language Processing Spotlight for Understanding Nigeria 2023 Presidential Election

From Twitter to Aso-Rock: A Natural Language Processing Spotlight for Understanding Nigeria 2023 Presidential Election

A Survey on Text Classification Algorithms: From Text to Predictions

Self-Supervised Molecular Pretraining Strategy for Low-Resource Reaction Prediction Scenarios

Contact Info

Product

Resources

About