Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

Emotional expressions are a fundamental aspect of human communication, with speech being one of the most natural modes of interaction. Speech Emotion Recognition (SER) is a significant research topic in Natural Language Processing (NLP), aimed at identifying emotions such as satisfaction, frustration, and anger from speech audio using multiple classifiers. This paper presents a method to emotion recognition from spontaneous Tunisian Dialect (TD) speech, marking the first work in the SER field to utilize spontaneous speech for emotion recognition in this dialect. The dataset was created from freely available YouTube videos across multiple domains and labeled with four perceived emotions: anger, satisfaction, frustration, and neutral. To address the data scarcity issue, we implemented data augmentation techniques, specifically Vocal Tract Length Perturbation (VTLP). The preprocessing of the speech signals involved cleaning the data from ambient and unwanted noises. We extracted and selected various spectral features, including mel-frequency cepstral coefficients (MFCC) and Linear Prediction Cepstral Coefficients (LPCC). Subsequently, we applied several classification methods: Support Vector Machine (SVM), Bidirectional Long Short-Term Memory (BiLSTM), Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Random Forest. Our experiments demonstrated that the Random Forest classifier achieved the highest F-score of 58.75%. The results were thoroughly discussed, analyzed, and compared across the five models using different feature extractions. This study provides valuable insights and advancements in the SER field, particularly for the TD, future research directions for improving emotion recognition systems.

show abstract

A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

Qarah,

Alsanoosy

2024

Applied Sciences

View full text Add to dashboard Cite

Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Tokenization of Tunisian Arabic: A Comparison between Three Machine Learning Models

Cited by 3 publications

References 17 publications

TTK: A toolkit for Tunisian linguistic analysis

TTK: A toolkit for Tunisian linguistic analysis

Emotion Recognition from Spontaneous Tunisian Dialect Speech

A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models

Contact Info

Product

Resources

About