This paper presents a new approach to two challenging NLP tasks in Classical Tibetan: word segmentation and Part-of-Speech (POS) tagging. We demonstrate how both these problems can be approached in the same way, by generating a memory-based tagger that assigns 1) segmentation tags and 2) POS tags to a test corpus consisting of unsegmented lines of Tibetan characters. We propose a three-stage workflow and evaluate the results of both the segmenting and the POS tagging tasks. We argue that the Memory-Based Tagger (MBT) and the proposed workflow not only provide an adequate solution to these NLP challenges, they are also highly efficient tools for building larger annotated corpora of Tibetan.
This paper presents a full procedure for the development of a segmented, POS-tagged and chunk-parsed corpus of Old Tibetan. As an extremely lowresource language, Old Tibetan poses non-trivial problems in every step towards the development of a searchable treebank. We demonstrate, however, that a carefully developed, semisupervised method of optimising and extending existing tools for Classical Tibetan, as well as creating specific ones for Old Tibetan, can address these issues. We thus also present the very first Tibetan Treebank in a variety of formats to facilitate research in the fields of NLP, historical linguistics and Tibetan Studies.
This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.
This paper investigates adjectival agreement in a group of Middle Welsh native prose texts and a sample of translations from around the end of the Middle Welsh period and the beginning of the Early Modern period. It presents a new methodology, employing tagged historical corpora allowing
for consistent linguistic comparison. The adjectival agreement case study tests a hypothesis regarding position and function of adjectives in Middle Welsh, as well as specific semantic groups of adjectives, such as colours or related modifiers. The systematic analysis using an annotated corpus
reveals that there are interesting differences between native and translated texts, as well as between individual texts. However, zooming in on our adjectival agreement case study, we conclude that these differences do not correspond to many of our hypotheses or assumptions about how certain
texts group together. In particular, no clear split into native and translated texts emerged between the texts in our corpus. This paper thus shows interesting results for both (historical) linguists, especially those working on agreement, and scholars of medieval Celtic philology and translation
texts.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.