The aim of this paper is to present the first Amazighe POS tagger. Very few linguistic resources have been developed so far for Amazighe and we believe that the development of a POS tagger tool is the first step needed for automatic text processing. The used data have been manually collected and annotated. We have used state-of-art supervised machine learning approaches to build our POS-tagging models. The obtained accuracy achieved 92.58% and we have used the 10-fold technique to further validate our results.
Amazigh is used by tens of millions of people mainly for oral communication. However, and like all the newly investigated languages in natural language processing, it is resource-scarce. The main aim of this paper is to present our POS taggers results based on two state of the art sequence labeling techniques, namely Conditional Random Fields and Support Vector Machines, by making use of a small manually annotated corpus of only 20k tokens. Since creating labeled data is very timeconsuming task while obtaining unlabeled data is less so, we have decided to gather a set of unlabeled data of Amazigh language that we have preprocessed and tokenized. The paper is also meant to address using semi-supervised techniques to improve POS tagging accuracy. An adapted self training algorithm, combining confidence measure with a function of Out Of Vocabulary words to select data for self training, has been used. Using this language independent method, we have managed to obtain encouraging results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.