A bilingual corpus is vital for natural language processing problems, especially in machine translation. The larger and better quality the corpus is, the higher the efficiency of the resulting machine translation is. There are two popular approaches to building a bilingual corpus. The first is building one automatically based on resources that are available on the internet, typically bilingual websites. The second approach is to construct one manually. Automated construction methods are being used more frequently because they are less expensive and there are a growing number of bilingual websites to exploit. In this paper, we use automated collection methods for a bilingual website to create a bilingual Chinese-Vietnamese corpus. In particular, the bilingual website we use to collect the data is the website of a multilingual dictionary (https://glosbe.com). We collected the Chinese-Vietnamese corpus from this website that includes more than 400k sentence pairs. We chose 100,000 sentence pairs in this corpus for machine translation experiments. From the corpus, we built five datasets consisting of 20k, 40k, 60k, 80k, and 100k sentence pairs, respectively. In addition, we built five additional datasets, applying word segmentation on the sentences of the original datasets. The experimental results showed that: (1) the quality of the corpus is relatively good with the highest BLEU score of 19.8, although there are still some issues that need to be addressed in future works; (2) the larger the corpus is, the higher the machine translation quality is; and (3) the untokenized datasets help train better translation models than the tokenized datasets.INDEX TERMS construction of a bilingual corpus; Chinese-Vietnamese machine translation; dictionary websites; Glosbe.
Recommender systems are challenged with providing accurate recommendations that meet the diverse preferences of users. The main information sources for these systems are the utility matrix and textual sources, such as item descriptions, users’ reviews, and users’ profiles. Incorporating diverse sources of information is a reasonable approach to improving recommendation accuracy. However, most studies primarily use the utility matrix, and when they use textual sources they do not integrate them with the utility matrix. This is due to the risk of combined information causing noise and reducing the effectiveness of good sources. To overcome this challenge, in this study we propose a novel method that utilizes the Transformer Model, a deep learning model that efficiently integrates textual and utility matrix information. The study suggests feature extraction techniques suitable for each information source and an effective integration method in the Transformer model. The experimental results indicate that the proposed model significantly improves recommendation accuracy compared to the baseline model (MLP) for the Mean Absolute Error (MAE) metric, with a reduction range of 10.79% to 31.03% for the Amazon sub-datasets. Furthermore, when compared to SVD, which is known as one of the most efficient models for recommender systems, the proposed model shows a decrease in the MAE metric by a range of 34.82% to 56.17% for the Amazon sub-datasets. Our proposed model also outperforms the graph-based model with an increase of up to 108% in Precision, a decrease of up to 65.37% in MAE, and a decrease of up to 59.24% in RMSE. Additionally, experimental results on the Movielens and Amazon datasets also demonstrate that our proposed model, which combines information from the utility matrix and textual sources, yields better results compared to using only information from the utility matrix.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.