In this study, we discover Russian “centers of excellence” and explore patterns of their collaboration with each other and with foreign partners. Highly cited papers serve as a proxy for “excellence” and coauthored papers as a measure of collaborative efforts. We find that currently research institutes (of the Russian Academy of Sciences as well as others) remain the key players despite recent government initiatives to stimulate university science. The contribution of the commercial sector to high‐impact research is negligible. More than 90% of Russian highly cited papers involve international collaboration, and Russian institutions often do not play a dominant role. Partnership with U.S., German, U.K., and French scientists increases markedly the probability of a Russian paper becoming highly cited. Patterns of national (“intranational”) collaboration in world‐class research differ significantly across different types of organizations; the strongest ties are between three nuclear/particle physics centers. Finally, we draw a coauthorship map to visualize collaboration between Russian centers of excellence.
This paper investigates how different features influence the translation quality of a Russian-English neural machine translation system. All the trained translation models are based on the OpenNMTpy system and share the state-of-the-art Transformer architecture. The majority of the models use the Yandex English-Russian parallel corpus as training data. The BLEU score on the test data of the WMT18 news translation task is used as the main measure of performance. In total, five different features are tested: tokenization, lowercase, the use of BPE (byte-pair encoding), the source of BPE, and the training corpus. The study shows that the use of tokenization and BPE seems to give considerable advantage while lowercase impacts the result insignificantly. As to the BPE vocabulary source, the use of bigger monolingual corpora such as News Crawl as opposed to the training corpus may provide a greater advantage. The thematic correspondence of the training and test data proved to be crucial. Quite high scores of the models so far may be attributed to the fact that both the Yandex parallel corpus and the WMT18 test set consist largely of news texts. At the same time, the models trained on the Open Subtitles parallel corpus show a substantially lower score on the WMT18 test set, and one comparable to the other models on a subset of Open Subtitles corpus not used in training. The expert evaluation of the two highest-scoring models showed that neither excels current Google Translate. The paper also provides an error classification, the most common errors being the wrong translation of proper names and polysemantic words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.