2021
DOI: 10.48550/arxiv.2102.11000
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

ElMehdi Boujou,
Hamza Chataoui,
Abdellah El Mekki
et al.

Abstract: Natural Language Processing (NLP) is today a very active field of research and innovation. Many applications need however big sets of data for supervised learning, suitably labelled for the training purpose. This includes applications for the Arabic language and its national dialects. However, such open access labeled data sets in Arabic and its dialects are lacking in the Data Science ecosystem and this lack can be a burden to innovation and research in this field. In this work, we present an open data set of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 10 publications
0
11
0
Order By: Relevance
“…Ali et al [11] developed a way to differentiate between Arabic dialects distinctly by taking bottleneck characteristics from the i-vector framework and phonetic and lexical data from a speech recognition system, which are then used to identify dialects in Arabic broadcast speech. Similarly, Boujou et al [12] introduced an open data collection of social data material in Arabic dialects obtained from the Twitter social network. The researchers then evaluated the dataset using four different classifiers, namely SGD, LR, NB, and linear SVC, with NB achieving the highest accuracy of 0.79%.…”
Section: Related Studiesmentioning
confidence: 99%
See 1 more Smart Citation
“…Ali et al [11] developed a way to differentiate between Arabic dialects distinctly by taking bottleneck characteristics from the i-vector framework and phonetic and lexical data from a speech recognition system, which are then used to identify dialects in Arabic broadcast speech. Similarly, Boujou et al [12] introduced an open data collection of social data material in Arabic dialects obtained from the Twitter social network. The researchers then evaluated the dataset using four different classifiers, namely SGD, LR, NB, and linear SVC, with NB achieving the highest accuracy of 0.79%.…”
Section: Related Studiesmentioning
confidence: 99%
“…Many attempts have been proposed in the area of automatic dialect identification (ADI), and early uses are based on dictionaries, rules, and language modeling [5][6][7][8][9][10]; more recently, a shift was made toward employing machine learning techniques [11][12][13][14][15][16][17][18][19][20][21][22][23][24], deep learning approaches [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40], and transfer learning methods [41][42][43][44][45][46][47][48][49]. Many of these investigations utilize prominent and accessible datasets, such as MADAR [49], NADI [50][51]…”
Section: Introductionmentioning
confidence: 99%
“…Our observation revealed a scarcity of open-source datasets available for MD. Consequently, in our research, we opted to work with two selected datasets to address this limitation: The Modelling Simulation and Data Analysis (MSDA) dataset (Boujou et al, 2021) and the Moroccan Arabic Corpus (MAC) dataset (Garouani and Kharroubi, 2021). We also combined the two to obtain a balanced dataset, the MSDA-MAC dataset:…”
Section: Datasetmentioning
confidence: 99%
“…Noise removal is the process of removing characters, numbers and pieces of text that may interfere with the analysis (Boujou et al, 2021). Noise removal is one of the most required steps in text pre-processing.…”
Section: Data Preparationmentioning
confidence: 99%
“…Another important step of the text pre-processing is removing the stop-words from the text (Boujou et al, 2021). Stop-words appear too frequently in any type of text.…”
Section: Data Preparationmentioning
confidence: 99%