Proceedings of the Third Workshop on Computational Linguistics For Uralic Languages 2017
DOI: 10.18653/v1/w17-0604
|View full text |Cite
|
Sign up to set email alerts
|

Instant Annotations ėxtendash Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora

Abstract: 25 AbstractThe paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out in Freiburg and in collaboration with Hamburg, Syktyvkar, Tromsø and Uppsala. Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
5
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 4 publications
0
5
0
Order By: Relevance
“…Meanwhile, increasingly more focus is dedicated on NLP research and bringing modern technologies to endangered languages. For example, mobile applications have been developed for data collection (Bird et al, 2014; and are actively used in documentation projects ; automatic speech recognition models have been created to aid with automatic phonetic or orthographic transcriptions focusing in indigenous Australian or tonal languages from China and the Americas (Michaud et al, 2018); machine translation for under-represented languages have been presented as new corpora have been collected (Abbott and Martinus, 2018;Abate et al, 2018); cross-lingual transfer has been successfully applied for tagging, morphological analysis and inflection (McCarthy et al, 2019;Anastasopoulos and Neubig, 2019); multitask and active learning are being used for learning from continuous annotations on multiple tasks (Gerstenberger et al, 2017;Anastasopoulos et al, 2018;Chaudhary et al, 2019); approaches dedicated to indigenous polysynthetic languages have been developed Kann et al, 2018); and computational methods have been used to study or discover typological features from large collections of text (Asgari and Schütze, 2017;Malaviya et al, 2017).…”
Section: Descriptionmentioning
confidence: 99%
“…Meanwhile, increasingly more focus is dedicated on NLP research and bringing modern technologies to endangered languages. For example, mobile applications have been developed for data collection (Bird et al, 2014; and are actively used in documentation projects ; automatic speech recognition models have been created to aid with automatic phonetic or orthographic transcriptions focusing in indigenous Australian or tonal languages from China and the Americas (Michaud et al, 2018); machine translation for under-represented languages have been presented as new corpora have been collected (Abbott and Martinus, 2018;Abate et al, 2018); cross-lingual transfer has been successfully applied for tagging, morphological analysis and inflection (McCarthy et al, 2019;Anastasopoulos and Neubig, 2019); multitask and active learning are being used for learning from continuous annotations on multiple tasks (Gerstenberger et al, 2017;Anastasopoulos et al, 2018;Chaudhary et al, 2019); approaches dedicated to indigenous polysynthetic languages have been developed Kann et al, 2018); and computational methods have been used to study or discover typological features from large collections of text (Asgari and Schütze, 2017;Malaviya et al, 2017).…”
Section: Descriptionmentioning
confidence: 99%
“…Bird and Chiang [7] discussed the potential role of machine translation in language documentation. Blokland et al [8,9] and Gerstenberger et al [10,11] proposed the application of proven natural language processing approaches as a method to facilitate language documentation efforts, in particular to automate the process of corpus annotation and to support the integration of legacy linguistic materials in contemporary documentation projects. The "Digital Language Survival Kit" [12], published as a part of the Digital Language Diversity Project, lists some of the basic resources and technologies (such as spell checkers, part-of-speech taggers, and speech synthesis and recognition tools) necessary to improve the digital vitality of minority languages.…”
Section: Introductionmentioning
confidence: 99%
“…Using the combination of the electronic lexicography resources and language technology tools mentioned above, corpus creation for this syntax project is completed as automatically as possible. This idea is presented in detail in Gerstenberger et al (2017), but a brief overview is provided here. The corpus consists of Pite Saami texts (in both spoken and written mode) transcribed in current orthographic standard and collected in the ELAN format³ and following the common structure stipulated by projects carried out by the Freiburg Research Group in Saami Studies.⁴ A python script runs each token through the FST processor, and then automatically creates annotations for lemma, morphological categories and part of speech based on this.…”
Section: Introductionmentioning
confidence: 99%