A number of Arabic NLP (or Arabic NLP-related) workshops and conferences have taken place in the last few years, both in the Arab World and in association with international conferences. The Arabic NLP workshop at EACL 2017 follows in the footsteps of these previous efforts to provide a forum for researchers to share and discuss their ongoing work. This particular workshop is the third in a series, following the First Arabic NLP workshop held at EMNLP 2014 in Doha, Qatar; and the Second Arabic NLP workshop held at ACL 2015 in Beijing, China.We received 47 submissions and selected 22 (47% acceptance rate) for presentation in the workshop. All papers were reviewed by three reviewers on average. The number of submissions is over twice that of the previous workshop in Beijing, which also had a higher acceptance rate (65%). Ten papers will be presented orally and 12 as part of a poster session. The presentation mode is independent of the ranking of the papers. The papers cover a diverse set of topics from Maltese and Arabic dialect processing to models of semantic similarity and credibility analysis, advances in Arabic treebanking, and error annotation for dyslexic texts.The quantity and quality of the contributions to the workshop are strong indicators that there is a continued need for this kind of dedicated Arabic NLP workshop.We would like to acknowledge all the hard work of the submitting authors and thank the reviewers for the valuable feedback they provided. We hope these proceedings will serve as a valuable reference for researchers and practitioners in the field of Arabic NLP and NLP in general.Nizar Habash, General Chair, on behalf of the organizers of the workshop.
AbstractThis paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification tagging. We also experiment with a lexicon-based method. Combining all the methods in a fall-back mechanism and introducing some linguistic rules, to deal with unseen tokens and ambiguous words, gives an overall accuracy of 93.14%. Finally, we introduced rules for language identification from sequences of recognised words.