Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.107
|View full text |Cite
|
Sign up to set email alerts
|

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Abstract: We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and webcrawled data using intensive data-mining techniques. Preliminary ex… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
36
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 26 publications
(36 citation statements)
references
References 17 publications
0
36
0
Order By: Relevance
“…The development of such models is a matter of high importance for the inclusion of communities, the preservation of endangered languages and more generally to support the rise of tailored NLP ecosystems for such languages (Schmidt and Wiegand, 2017;Stecklow, 2018;Seddah et al, 2020). In that regard, the advent of the Universal Dependencies project (Nivre et al, 2016) and the WikiAnn dataset (Pan et al, 2017) have greatly increased the number of covered languages by providing annotated datasets for more than 90 languages for dependency parsing and 282 languages for NER.…”
Section: Background and Motivationmentioning
confidence: 99%
See 2 more Smart Citations
“…The development of such models is a matter of high importance for the inclusion of communities, the preservation of endangered languages and more generally to support the rise of tailored NLP ecosystems for such languages (Schmidt and Wiegand, 2017;Stecklow, 2018;Seddah et al, 2020). In that regard, the advent of the Universal Dependencies project (Nivre et al, 2016) and the WikiAnn dataset (Pan et al, 2017) have greatly increased the number of covered languages by providing annotated datasets for more than 90 languages for dependency parsing and 282 languages for NER.…”
Section: Background and Motivationmentioning
confidence: 99%
“…OSCAR is a corpus extracted from a Common Crawl Web snapshot. 2 It provides a significant amount of data for all the unseen languages we work with, except for Buryat, Meadow Mari, Erzya and Livvi for which we use Wikipedia dumps and for Narabizi, Naija and Faroese, for which we use data collected by Seddah et al (2020), Caron et al (2019) and Biemann et al (2007) respectively.…”
Section: Raw Datamentioning
confidence: 99%
See 1 more Smart Citation
“…We obtain the data for the part-of-speech tagging task from the Universal Dependencies project (Zeman et al, 2020), which currently gathers data annotated with universal POS tags for more than 90 languages, although there are differences in size and domain. For Algerian we use the annotations from Seddah et al (2020). We found no training sets available for Thai and Cantonese, hence we use them for testing only.…”
Section: Part-of-speechmentioning
confidence: 99%
“…There is also the NArabizi Treebank(Seddah et al, 2020) which includes partial morphological annotation where the total number of unique annotations is 46 in contrast to the SAGT Treebank which has 795 unique morphological annotations. Hence, we did not use this treebank in our study.…”
mentioning
confidence: 99%