Chinese-Uyghur Bilingual Lexicon Induction Based on Morpheme Sequence and Weak Supervision

Aysa, Anwar; Ablimit, Mijit; Yilahun, Hankiz; Hamdulla, Askar

doi:10.1109/prml56267.2022.9882227

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By integrating cross-lingual representations with pre-trained word embeddings in a fully unsupervised initialization on UBLI, the proposed method outperforms existing state-of-the-art methods on low-resource language pairs. Addressing the poor alignment of Chinese-Uyghur cross-language word embeddings due to significant morphological differences, Aysa et al [13] proposed a multilingual morphological analyzer based on a morpheme sequence combined with neural network cross-language word embedding vector mapping, and used for Chinese-Uyghur bilingual dictionary extraction. They used robust morpheme segmentation and stemming of bilingual text data to obtain excellent and meaningful word semantic features.…”

Section: Bilingual Lexicon Inductionmentioning

confidence: 99%

Neural Network-Based Bilingual Lexicon Induction for Indonesian Ethnic Languages

2023

View full text Add to dashboard Cite

Indonesia has a variety of ethnic languages, most of which belong to the same language family: the Austronesian languages. Due to the shared language family, words in Indonesian ethnic languages are very similar. However, previous research suggests that these Indonesian ethnic languages are endangered. Thus, to prevent that, we propose the creation of a bilingual dictionary between ethnic languages, using a neural network approach to extract transformation rules, employing character-level embedding and the Bi-LSTM method in a sequence-to-sequence model. The model has an encoder and decoder. The encoder reads the input sequence character by character, generates context, and then extracts a summary of the input. The decoder produces an output sequence wherein each character at each timestep, as well as the subsequent character output, are influenced by the previous character. The first experiment focuses on Indonesian and Minangkabau languages with 10,277 word pairs. To evaluate the model’s performance, five-fold cross-validation was used. The character-level seq2seq method (Bi-LSTM as an encoder and LSTM as a decoder) with an average precision of 83.92% outperformed the SentencePiece byte pair encoding (vocab size of 33) with an average precision of 79.56%. Furthermore, to evaluate the performance of the neural network model in finding the pattern, a rule-based approach was conducted as the baseline. The neural network approach obtained 542 more correct translations compared to the baseline. We implemented the best setting (character-level embedding with Bi-LSTM as the encoder and LSTM as the decoder) for four other Indonesian ethnic languages: Malay, Palembang, Javanese, and Sundanese. These have half the size of input dictionaries. The average precision scores for these languages are 65.08%, 62.52%, 59.69%, and 58.46%, respectively. This shows that the neural network approach can identify transformation patterns of the Indonesian language to closely related languages (such as Malay and Palembang) better than distantly related languages (such as Javanese and Sundanese).

show abstract

Section: Bilingual Lexicon Inductionmentioning

confidence: 99%