We present MICHAEL, a lightweight method developed for the MADAR shared task on travel domain Dialect Identification (DID). It uses character-level features and perform classification without any pre-processing. Character N-grams extracted from the original sentences are used to train a Multinomial Naive Bayes classifier. MICHAEL achieved an official score (accuracy) of 53.25% with 1 ≤ N ≤ 3 but showed a much better result with character 4-grams (62.17%).
In this paper, we present a method of grammatical and semantic disambiguation of the particle or the token "ḥattā" in Arabic language. This method is based on a thorough analysis of the context. Our goal is to achieve the maximum linguistic information of this token thanks to a corpus in order to modeling as a grammar or rules. To do this, we first developed a corpus that contains the different contexts of the token "ḥattā". Second from this corpus, we identified the different linguistic criteria of this token that allow us to correctly identify it. Finally, we codified this information in the form of linguistic rules in order to detect it easily by machine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.