Proceedings of the Third Arabic Natural Language Processing Workshop 2017
DOI: 10.18653/v1/w17-1317
|View full text |Cite
|
Sign up to set email alerts
|

Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties

Abstract: The success of machine learning for automatic speech processing has raised the need for large scale datasets. However, collecting such data is often a challenging task as it implies significant investment involving time and money cost. In this paper, we devise a recipe for building largescale Speech Corpora by harnessing Web resources namely YouTube, other Social Media, Online Radio and TV. We illustrate our methodology by building KALAM'DZ, An Arabic Spoken corpus dedicated to Algerian dialectal varieties. Th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(14 citation statements)
references
References 11 publications
0
10
0
Order By: Relevance
“…Bougrine et al [9] introduced a preliminary version of KALAM'DZ; the corpus is limited to Web-based corpus of 8 Algerian dialects crawled from some Algerian TV and YouTube channels. The corpus encompasses eight major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented to utterances of at least 6 sec.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Bougrine et al [9] introduced a preliminary version of KALAM'DZ; the corpus is limited to Web-based corpus of 8 Algerian dialects crawled from some Algerian TV and YouTube channels. The corpus encompasses eight major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented to utterances of at least 6 sec.…”
Section: Related Workmentioning
confidence: 99%
“…With the recent development of datadriven and deep learning-oriented approaches, one of the most important aspects to consider is to have access to a substantial volume of representative data. Indeed, the notion of "More data is better data" was born with the success of automatic recognition [9] where important amounts of training data are required [18]. The performance of systems depends mainly on their training corpus characteristics, which makes them an integral part of recognition systems [27].…”
Section: Introductionmentioning
confidence: 99%
“…The system achieved a total accuracy of 62.75% compared to 60.2% that was achieved by a similar system in [6]. For the Algerian Arabic dialect, a deep neural network based approach was introduced in [7] to evaluate a web based corpus for the dialects of Algeria KALAM'DZ [8]. The results showed that the DNN based approach and the support vector based approach performed similarly.…”
Section: Related Workmentioning
confidence: 99%
“…Compared with Twitter and Facebook, YouTube has been less examined by researchers; thus, previous research has not developed a best-practice scraping procedure for YouTube. The programme we used, youtube-dl (see https:// rg3.github.io/youtube-dl/), has been used by a number of other studies (Botta et al 2016;Bougrine et al 2017;Tomàs-Buliart et al 2010;Schwemmer and Ziewiecki 2018). After entering a predetermined list of channel names and fields of information (e.g.…”
Section: For the Futurementioning
confidence: 99%