Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used.The aim of this research is to develop technologies for automatic processing of Ainu-a language isolate that is native to northern parts of Japan, which is currently recognized as nearly extinct (e.g., by Lewis et al. [13]).In particular, we aimed at improving the part-of-speech tagger for the Ainu language (POST-AL), a tool for computer-supported linguistic analysis of the Ainu language, initially developed by Ptaszynski and Momouchi [14].The task of developing NLP tools for Ainu poses several challenges. Firstly, large-scale digital language resources required for many NLP tasks (such as annotated corpora) are not available for the Ainu language. In this paper we describe our attempt to solve this problem by merging two different digitized dictionaries into one data set. Secondly, there exists no single standard for transcription and word segmentation of the Ainu language, especially in texts collected in earlier years. To address that problem, POST-AL has been equipped with the functions of transcription normalization and word segmentation. In this paper we describe in detail the proposed methodology including recent improvements. Another functionality of POST-AL is part-of-speech (POS) tagging. To improve this accuracy we developed a hybrid method of POS disambiguation, combining lexical n-grams and term frequency. The results of evaluation experiments presented in this paper show that there are differences in part-of-speech classification of certain forms between authors of different dictionaries and text annotations, which creates yet another challenge, to be tackled in the future.The remainder of this paper is organized as follows. In Section 2 we briefly describe the characteristics and the current status of the Ainu language. In Section 3 we provide an overview of some of the previous studies on the Ainu language, including the few existing research projects in the field of natural language processing. Section 4 presents our algorithms for normalization, word segmentation and part-of-speech tagging. In Sections 5 and 6 we introduce the training data (dictionaries) and test data used in this research. Section 7 summarizes the evaluation methods we applied. In Section 8 we present the results of the evaluation experiments. Finally, Section 9 contains conclusions and some ideas for future improvements.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.