Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Li, Fenfang; Lv, Hui; Duola,; Yong, Binbin; Zhou, Qingguo

doi:10.1145/3527663

Cited by 2 publications

(1 citation statement)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One example of this is the Tibetan punctuation mark "།" (ཤད shad), which can be used after a word, phrase, or sentence. As a result, it can be unclear whether a shad is meant to indicate the end of a sentence or not (Li et al, 2022).…”

Section: Sentence Boundaries and Punctuationmentioning

confidence: 99%

Thai sentence segmentation using large language models

Panitsrisit

View full text Add to dashboard Cite

Thai sentence segmentation has been on the topic of interest among Thai NLP communities. However, not much literature has explored the use of transformer-based large language models to tackle the issue. We conduct three experiments on the LST20 corpus, including (1) fine-tuning WangchanBERTa, a large language model pre-trained on Thai, across different classification tasks, (2) joint learning for clause and sentence segmentation, and (3) cross-lingual transfer using the multilingual model XLM-RoBERTa. Our findings show that WangchanBERTa outperforms other models in Thai sentence segmentation, and fine-tuning it with token and contextual information further improves its performance. However, cross-lingual transfer from English and Chinese to Thai is not effective for this task.

show abstract

Section: Sentence Boundaries and Punctuationmentioning

confidence: 99%

Thai sentence segmentation using large language models

Panitsrisit

View full text Add to dashboard Cite

show abstract

Improved Tibetan Word Vectors Models Based on Position Information Fusion

Lv,

Yang

et al. 2024

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

Tibetan language processing is crucial for preserving its rich cultural heritage and reducing communication barriers between different languages. However, as a low-resource language, the development of Tibetan natural language processing has lagged behind. To address the unique and complex structural information of Tibetan, this paper improves the embedding model based on fundamental Tibetan Component-and-Character-and-Word-based Embedding (TCCWE) to enhance the effectiveness of word vector representation. We incorporate position information into the training of Tibetan word vectors, developing models based on components, characters, and their integration. Furthermore, to evaluate the effectiveness of these word vectors, we propose an intrinsic evaluation set, wordsimT, based on K-means clustering. Experimental results demonstrate that the character-based positional vector integration model achieves a Spearman's rank correlation coefficient of 79.99% on the wordsimT benchmark, outperforming the baseline TCCWE model by 1.51%. Additionally, we validate the proposed models in downstream text classification tasks. These findings underscore the importance of incorporating positional information in Tibetan word vectors.

show abstract

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Cited by 2 publications

References 12 publications

Thai sentence segmentation using large language models

Thai sentence segmentation using large language models

Improved Tibetan Word Vectors Models Based on Position Information Fusion

Contact Info

Product

Resources

About