Historically-unwritten Arabic dialects are increasingly appearing online in social media texts and are often intermixed with other languages, including Modern Standard Arabic, English, and French. The next generation analyst will need new capabilities to quickly distinguish among the languages appearing in a given text and to identify informative patterns of language switching that occur within a user's social network-patterns that may correspond to socio-cultural aspects such as participants' perceived and projected group identity. This paper presents work to (i) collect texts written in Moroccan Darija, a low-resource Arabic dialect from North Africa, and (ii) build an annotation tool that (iii) supports development of automatic language and dialect identification and (iv) provides social and information network visualizations of languages identified in tweet conversations.
Many languages, including Modern Standard Arabic (MSA), insert resumptive pronouns in relative clauses, whereas many others, such as English, do not, using empty categories instead. This discrepancy is a source of difficulty when translating between these languages because there are words in one language that correspond to empty categories in the other, and these words must either be inserted or deleted-depending on translation direction. In this paper, we first examine challenges presented by resumptive pronouns in MSA-English translations and review resumptive pronoun translations generated by a popular online MSA-English MT engine. We then present what is, to the best of our knowledge, the first system for automatic identification of resumptive pronouns. The system achieves 91.9 F1 and 77.8 F1 on Arabic Treebank data when using gold standard parses and automatic parses, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.