Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence. Implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.
Sentence Boundary Disambiguation (SBD) is a preprocessing step for natural language processing.Segmenting text into sentences is essential for Deep Learning (DL) and pretraining language models. Tibetan punctuation marks may involve ambiguity about the sentences' beginnings and endings. Hence, the ambiguous punctuation marks must be distinguished, and the sentence structure must be correctly encoded in language models. This study proposed a component-level Tibetan SBD approach based on the DL model. The models can reduce the error amplification caused by word segmentation and part-of-speech tagging. Although most SBD methods have only considered text on the left side of punctuation marks, this study considers the text on both sides. In this study, 465 669 Tibetan sentences are adopted, and a Bidirectional Long Short-Term Memory (Bi-LSTM) model is used to perform SBD. The experimental results show that the F1-score of the Bi-LSTM model reached 96%, the most efficient among the six models. Experiments are performed on low-resource languages such as Turkish and Romanian, and high-resource languages such as English and German, to verify the models' generalization.
Tibetan word segmentation and POS tagging are the primary tasks of Tibetan natural language processing. Most of existing methods of Tibetan word segmentation and POS tagging are based on rules and statistics, which need manual construction of features. In addition, the joint mode has shown stronger capabilities for word segmentation and POS tagging, and have received great interests. In this paper, we propose Bi-LSTM+IDCNN+CRF structures, a simple yet effective end-to-end neural network model, for joint Tibetan word segmentation and POS tagging. We conduct step-by-step and joint experiments on the Tibetan datasets. The results demonstrate that the performance of the Bi-LSTM+IDCNN+CRF model is the best regardless of the step-by-step or joint mode. We obtain state-of-the-art performance in the joint tagging mode. The F1 score of the word segmentation task reached 92.31%, and the F1 score of the POS tagging task reached 81.26%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.