Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories 2020
DOI: 10.18653/v1/2020.tlt-1.3
|View full text |Cite
|
Sign up to set email alerts
|

Meta-dating the PArsed Corpus of Tibetan (PACTib)

Abstract: This paper presents PACTib, the PArsed Corpus of Tibetan. This new resource is unique in bringing together a large number of Tibetan texts (>5000) from the 11th century until the present day. The texts in this diachronic corpus are provided with metadata containing information on dates and patron-/authorship and linguistic annotation in the form of tokenisation, sentence segmentation, part-of-speech tags and syntactic phrase structure. With over 166 million tokens across 11 centuries and a variety of genres, P… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 3 publications
0
3
0
Order By: Relevance
“…The concrete outputs, both the open-source software and the new version of ACTib, will thus be of great use to linguists and scholars of Tibetan studies. These resources are now of such a high level of accuracy that it is worthwhile to extend them with relevant metadata and phrase-structure to create a historical treebank (see Meelen and Roux 2020). In future work, we aim to create an even better version of the ACTib by improving the neural-network model for POS tagging in particular by optimising the hyperparameters and feature engineering and implementing a small number of highly complex rule-based corrections that were beyond the scope of the present article.…”
Section: Discussionmentioning
confidence: 99%
“…The concrete outputs, both the open-source software and the new version of ACTib, will thus be of great use to linguists and scholars of Tibetan studies. These resources are now of such a high level of accuracy that it is worthwhile to extend them with relevant metadata and phrase-structure to create a historical treebank (see Meelen and Roux 2020). In future work, we aim to create an even better version of the ACTib by improving the neural-network model for POS tagging in particular by optimising the hyperparameters and feature engineering and implementing a small number of highly complex rule-based corrections that were beyond the scope of the present article.…”
Section: Discussionmentioning
confidence: 99%
“…On the Tibetan side, tokenisation was converted to a syllable-tagging and recombination task with the ACTib scripts 6 developed by Meelen et al (2021). As for sentence segmentation, we could use the technique developed by Meelen and Roux (2020) and optimised by Faggionato, Hill, and Meelen (2022) to create sentence boundaries in Tibetan, which is good, but not 100% accurate. Existing automatic aligners rely on sentence boundaries, so accuracy is of crucial importance.…”
Section: Tokenisationmentioning
confidence: 99%
“…For Tibetan, we used the sūtra translations in the Kangyur (the electronic Derge version of the eKangyur collection), as well as electronic versions of commentarial and other texts in the entire eTengyur to create a corpus that is large enough to create word embeddings. The eKangyur consists of around 27 m tokens and the eTengyur consists of around 58m tokens (see Meelen & Roux, 2020); these together represent 31k unique tokens.…”
Section: Developing Embeddingsmentioning
confidence: 99%