Instant Annotations ėxtendash Applying NLP Methods to the
            Annotation of Spoken Language Documentation Corpora

Gerstenberger, Ciprian; Partanen, Niko; Rießler, Michael; Wilbur, Joshua Karl

doi:10.18653/v1/w17-0604

Cited by 6 publications

(5 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Meanwhile, increasingly more focus is dedicated on NLP research and bringing modern technologies to endangered languages. For example, mobile applications have been developed for data collection (Bird et al, 2014; and are actively used in documentation projects ; automatic speech recognition models have been created to aid with automatic phonetic or orthographic transcriptions focusing in indigenous Australian or tonal languages from China and the Americas (Michaud et al, 2018); machine translation for under-represented languages have been presented as new corpora have been collected (Abbott and Martinus, 2018;Abate et al, 2018); cross-lingual transfer has been successfully applied for tagging, morphological analysis and inflection (McCarthy et al, 2019;Anastasopoulos and Neubig, 2019); multitask and active learning are being used for learning from continuous annotations on multiple tasks (Gerstenberger et al, 2017;Anastasopoulos et al, 2018;Chaudhary et al, 2019); approaches dedicated to indigenous polysynthetic languages have been developed Kann et al, 2018); and computational methods have been used to study or discover typological features from large collections of text (Asgari and Schütze, 2017;Malaviya et al, 2017).…”

Section: Descriptionmentioning

confidence: 99%

Endangered Languages meet Modern NLP

Anastasopoulos

Cox

Neubig

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts

View full text Add to dashboard Cite

Section: Descriptionmentioning

confidence: 99%

Endangered Languages meet Modern NLP

Anastasopoulos

Cox

Neubig

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts

View full text Add to dashboard Cite

“…Bird and Chiang [7] discussed the potential role of machine translation in language documentation. Blokland et al [8,9] and Gerstenberger et al [10,11] proposed the application of proven natural language processing approaches as a method to facilitate language documentation efforts, in particular to automate the process of corpus annotation and to support the integration of legacy linguistic materials in contemporary documentation projects. The "Digital Language Survival Kit" [12], published as a part of the Digital Language Diversity Project, lists some of the basic resources and technologies (such as spell checkers, part-of-speech taggers, and speech synthesis and recognition tools) necessary to improve the digital vitality of minority languages.…”

Section: Introductionmentioning

confidence: 99%

Improving Basic Natural Language Processing Tools for the Ainu Language

et al. 2019

View full text Add to dashboard Cite

Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used.The aim of this research is to develop technologies for automatic processing of Ainu-a language isolate that is native to northern parts of Japan, which is currently recognized as nearly extinct (e.g., by Lewis et al. [13]).In particular, we aimed at improving the part-of-speech tagger for the Ainu language (POST-AL), a tool for computer-supported linguistic analysis of the Ainu language, initially developed by Ptaszynski and Momouchi [14].The task of developing NLP tools for Ainu poses several challenges. Firstly, large-scale digital language resources required for many NLP tasks (such as annotated corpora) are not available for the Ainu language. In this paper we describe our attempt to solve this problem by merging two different digitized dictionaries into one data set. Secondly, there exists no single standard for transcription and word segmentation of the Ainu language, especially in texts collected in earlier years. To address that problem, POST-AL has been equipped with the functions of transcription normalization and word segmentation. In this paper we describe in detail the proposed methodology including recent improvements. Another functionality of POST-AL is part-of-speech (POS) tagging. To improve this accuracy we developed a hybrid method of POS disambiguation, combining lexical n-grams and term frequency. The results of evaluation experiments presented in this paper show that there are differences in part-of-speech classification of certain forms between authors of different dictionaries and text annotations, which creates yet another challenge, to be tackled in the future.The remainder of this paper is organized as follows. In Section 2 we briefly describe the characteristics and the current status of the Ainu language. In Section 3 we provide an overview of some of the previous studies on the Ainu language, including the few existing research projects in the field of natural language processing. Section 4 presents our algorithms for normalization, word segmentation and part-of-speech tagging. In Sections 5 and 6 we introduce the training data (dictionaries) and test data used in this research. Section 7 summarizes the evaluation methods we applied. In Section 8 we present the results of the evaluation experiments. Finally, Section 9 contains conclusions and some ideas for future improvements.

show abstract

“…Using the combination of the electronic lexicography resources and language technology tools mentioned above, corpus creation for this syntax project is completed as automatically as possible. This idea is presented in detail in Gerstenberger et al (2017), but a brief overview is provided here. The corpus consists of Pite Saami texts (in both spoken and written mode) transcribed in current orthographic standard and collected in the ELAN format³ and following the common structure stipulated by projects carried out by the Freiburg Research Group in Saami Studies.⁴ A python script runs each token through the FST processor, and then automatically creates annotations for lemma, morphological categories and part of speech based on this.…”

Section: Introductionmentioning

confidence: 99%

Extracting inflectional class assignment in Pite Saami: Nouns, verbs and those pesky adjectives

Wilbur¹

2018

Proceedings of the Fourth International Workshop on Computatinal Linguistics of Uralic Languages

Self Cite

View full text Add to dashboard Cite

The main goal of this paper is to describe to what extent the three main open word classes in Pite Saami (nouns, verbs and adjectives) can be automatically assigned to inflectional classes in language technology, specifically for a Finite State Transducer. For each of these word classes, the relevant structural features necessary for determining inflectional class membership are described. In this, a clear difference between the behavior of nouns and verbs, on the one hand, and that of adjectives, on the other hand, is ascertained. While morphophonology, as seen in the paradigmatic behavior of all three word classes, is complex and features a number of types of stem alternations, nouns and verbs are predictable, while adjectives are not. With this in mind, a basic algorithm for extracting inflectional class assignment for nouns and verbs is presented for use in a LEXC framework. In contrast to this, adjectives must be assigned to inflectional classes manually. The main TWOLC rules used to trigger morphophonological alternations are also outlined. The Pite Saami lexicographic database that forms the backbone for the LEXC stem files is managed using FileMaker Pro database software, and the workflow used to extract and update LEXC files from that database is described, focussing on the differences between nouns and verbs, and adjectives. In this, light is shed on how, on the one hand, nominal and verbal inflectional patters are highly complex yet reliably systematic, while adjective morphophonology is complex and unpredictable. KokkuvõteSelle artikli peamine eesmärk on kirjeldada, mil määral saab kolme põhilist avatud sõnaklassi (substantiive, verbe ja adjektiive) pite saami keeles automaatselt flekteerida kasutades keeletehnoloogia FST-d. Artiklis kirjeldatakse iga sõ-naliigi muuttüübi määramiseks vajalikke strukturaalseid omadusi ning näidatak-se, et adjektiivid on substantiividest ja verbidest selgelt erinevad. Samal ajal kui kõigi kolme sõnaklassi paradigmaatilist käitumist iseloomustab kompleksne paljusid tüvevahelduse tüüpe hõlmav morfofonoloogia, saab substantiivide ja verbide muutumist ennustada, kuid adjektiivide oma mitte. Seega kirjeldatakse artiklis Tuuakse välja ka peamised TWOLC reeglid, mida kasutatakse morfofonoloogilise vahelduse tekitamiseks. LEXC tüvefailide põhialuseks on pite saami keele leksikograafiline andmebaas, mida hallatakse FileMaker tarkvaraga; artiklis kirjeldatakse sellest andmebaasist LEXC failide väljavõtmise ja nende uuendamise töö-voogu, keskendudes erinevustele nimisõnade ja verbide, ning adjektiivide vahel. Näidatakse, et substantiivide ja verbide fleksioonimustrid on küll komplekssed, kuid väga süstemaatilised, samas kui adjektiivide morfofonoloogia on keeruline ning raskesti ennustatav.

show abstract

Instant Annotations ėxtendash Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora

Cited by 6 publications

References 4 publications

Endangered Languages meet Modern NLP

Endangered Languages meet Modern NLP

Improving Basic Natural Language Processing Tools for the Ainu Language

Extracting inflectional class assignment in Pite Saami: Nouns, verbs and those pesky adjectives

Contact Info

Product

Resources

About