Meta-dating the PArsed Corpus of Tibetan (PACTib)

Meelen, Marieke; Roux, Élie

doi:10.18653/v1/2020.tlt-1.3

Cited by 2 publications

(3 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The concrete outputs, both the open-source software and the new version of ACTib, will thus be of great use to linguists and scholars of Tibetan studies. These resources are now of such a high level of accuracy that it is worthwhile to extend them with relevant metadata and phrase-structure to create a historical treebank (see Meelen and Roux 2020). In future work, we aim to create an even better version of the ACTib by improving the neural-network model for POS tagging in particular by optimising the hyperparameters and feature engineering and implementing a small number of highly complex rule-based corrections that were beyond the scope of the present article.…”

Section: Discussionmentioning

confidence: 99%

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Meelen

Roux²,

Hill

2021

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Self Cite

View full text Add to dashboard Cite

This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.

show abstract

Section: Discussionmentioning

confidence: 99%

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Meelen

Roux²,

Hill

2021

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…On the Tibetan side, tokenisation was converted to a syllable-tagging and recombination task with the ACTib scripts 6 developed by Meelen et al (2021). As for sentence segmentation, we could use the technique developed by Meelen and Roux (2020) and optimised by Faggionato, Hill, and Meelen (2022) to create sentence boundaries in Tibetan, which is good, but not 100% accurate. Existing automatic aligners rely on sentence boundaries, so accuracy is of crucial importance.…”

Section: Tokenisationmentioning

confidence: 99%

“…For Tibetan, we used the sūtra translations in the Kangyur (the electronic Derge version of the eKangyur collection), as well as electronic versions of commentarial and other texts in the entire eTengyur to create a corpus that is large enough to create word embeddings. The eKangyur consists of around 27 m tokens and the eTengyur consists of around 58m tokens (see Meelen & Roux, 2020); these together represent 31k unique tokens.…”

Section: Developing Embeddingsmentioning

confidence: 99%

Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan

Felbur

Meelen

Vierthaler

2022

Journal of Open Humanities Data

Self Cite

View full text Add to dashboard Cite

In this paper we present the first-ever procedure for identifying highly similar sequences of text in Chinese and Tibetan translations of Buddhist sūtra literature. We initially propose this procedure as an aid to scholars engaged in the philological study of Buddhist documents. We create a cross-lingual embedding space by taking the cosine similarity of average sequence vectors in order to produce unsupervised similar cross-linguistic parallel alignments at word, sentence, and even paragraph level. Initial results show that our method lays a solid foundation for the future development of a fully-fledged Information Retrieval tool for these (and potentially other) low-resource historical languages.

show abstract

Meta-dating the PArsed Corpus of Tibetan (PACTib)

Cited by 2 publications

References 3 publications

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan

Contact Info

Product

Resources

About