The Research on Tibetan Text Classification Based on N-Gram Model

Deng, Zhou; He, Wen Huang; Wu, Tao

doi:10.4028/www.scientific.net/amm.543-547.1896

Cited by 1 publication

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given that different tokenization methods can impact both training efficiency and accuracy of the model, the following analysis provides a brief overview of the effects of these different tokenization methods. [10] [11] involves grouping N characters together and splitting a sentence into segments of N characters each. It is primarily used for calculating the probability of a sentence.…”

Section: Analysis Of Different Tokenizersmentioning

confidence: 99%

Machine translation of classical Chinese based on unigram segmentation transformer framework

Ju,

Xin,

2024

ACE

View full text Add to dashboard Cite

In the translation work of Chinese ancient books, traditional manual translation is difficult and inefficient. As an important field of natural language processing, machine translation is expected to solve this problem. Due to the rapid development of NLP technology, prior works mainly follow the pipeline of Transformer when dealing with the machine translation task, which can extract the high-quality feature representation with its self-attention mechanism. The great success of Transformer has inspired the direction of our ancient text translation work. In this paper, we screen the Unigram word division by exploring and comparing, and propose a solution for the translation of ancient literary texts. Specifically, we adopt the evaluation of BLEU value and achieve the BLEU values of 43.4 and 40.03 for short and long sentences respectively. When compared with the results of Baidu Translation, our BLEU values increase by 8.12 and 5.18. Additionally, our translation results are more in line with the original text than Baidu Translation, demonstrating the potential and advantage of the model in bridging the ancient and modern Chinese era rift.

show abstract

Section: Analysis Of Different Tokenizersmentioning

confidence: 99%