Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of social media. More precisely, this paper focuses on the Tunisian Dialect of Arabic (TAD) with an application on automatic machine translation for a social media text into MSA and any other target language. Linguistic tools such as a bilingual TAD-MSA lexicon and a set of grammatical mapping rules are collaboratively constructed and exploited in addition to a language model to produce MSA sentences of Tunisian dialectal sentences. This work is a first-step towards collaboratively constructed semantic and lexical resources for Arabic Social Media within the ASMAT (Arabic Social Media Analysis Tools) project.
Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).
One of the existing difficulties in natural language processing applications is the lack of appropriate tools for the recognition, translation, and/or transliteration of named entities (NEs), specifically for lessresourced languages. In this paper, we propose a new method to automatically label multilingual parallel data for Arabic-French pair of languages with named entity tags and build lexicons of those named entities with their transliteration and/or translation in the target language. For this purpose, we bring in a third well-resourced language, English, that might serve as pivot, in order to build an Arabic-French NE Translation lexicon. Evaluations on the Arabic-French pair of languages using English as pivot in the transitive model showed the effectiveness of the proposed method for mining Arabic-French named entities and their translations. Moreover, the integration of this component in statistical machine translation outperformed the baseline system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.