Recent neural machine translation (NMT) systems have been greatly improved by encoder-decoder models with attention mechanisms and sub-word units. However, important differences between languages with logographic and alphabetic writing systems have long been overlooked. This study focuses on these differences and uses a simple approach to improve the performance of NMT systems utilizing decomposed sub-character level information for logographic languages.Our results indicate that our approach not only improves the translation capabilities of NMT systems between Chinese and English, but also further improves NMT systems between Chinese and Japanese, because it utilizes the shared information brought by similar sub-character units.1 Taking the ASPEC corpus as an example, the average word lengths are roughly 1.5 characters (Chinese words, tokenized by Jieba tokenizer), 1.7 characters (Japanese words, tokenized by MeCab tokenizer), and 5.7 characters (English words, tokenized by Moses tokenizer), respectively. Therefore, when a sub-word model of similar vocabulary size is applied directly, English sub-words usually contain several letters, which are more effective in facilitating NMT, whereas Chinese and Japanese sub-words are largely just characters.2. We facilitate the encoding or decoding process by using sub-character sequences on either the source or target side of the NMT system. This will improve translation performance; if sub-character information is shared between the encoder and decoder, it will further benefit the NMT system.3. Specifically, Chinese ideograph 4 data and Japanese stroke data are the best choices for relevant NMT tasks.
Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both “logographic” language pairs (JA–ZH) and “logographic + alphabetic” (JA–EN and ZH–EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.
Some English adjectives accept both synthetic and analytic comparative and superlative forms (e.g. thicker vs more thick, happiest vs most happy). As many as 20+ variables have been claimed to affect this choice (see Leech & Culpeper 1997; Lindquist 2000; Mondorf 2003, 2009). However, many studies consider one variable at a time without systematically controlling for other variables (i.e. they take a monofactorial approach). Further, very little research has been done on superlatives. Following Hilpert's (2008) multifactorial study, we investigate the simultaneous contribution of 17 variables towards comparative and superlative alternation and further measure the strength(s) of the predictors. On the whole, phonological predictors are much more important than syntactic and frequency-related predictors. The predictors of the number of syllables and final segments in <-y> consistently outrank other predictors in both models. Important differences have also been identified. Many syntactic variables, such as predicative position and presence of complements, are weak or non-significant in the comparative model but have stronger effects in the superlative model. Further, higher frequency of an adjective leads to a preference for the synthetic -er variant in comparatives but the analytic most variant in superlatives. The study shows that generalizations about comparatives do not straightforwardly carry over to superlatives.
In most machine learning tasks, we evaluate a model M on a given data population S by measuring a population-level metric F(S;M). Examples of such evaluation metric F include precision/recall for (binary) recognition, the F1 score for multi-class classification, and the BLEU metric for language generation. On the other hand, the model M is trained by optimizing a sample-level loss G(S_t; M) at each learning step t, where S_t is a subset of S (a.k.a. the mini-batch). Popular choices of G include cross-entropy loss, the Dice loss, and sentence-level BLEU scores. A fundamental assumption behind this paradigm is that the mean value of the sample-level loss G, if averaged over all possible samples, should effectively represent the population-level metric F of the task, such as, that E[ G(S_t; M) ] ~ F(S; M). In this paper, we systematically investigate the above assumption in several NLP tasks. We show, both theoretically and experimentally, that some popular designs of the sample-level loss G may be inconsistent with the true population-level metric F of the task, so that models trained to optimize the former can be substantially sub-optimal to the latter, a phenomenon we call it, Simpson's bias, due to its deep connections with the classic paradox known as Simpson's reversal paradox in statistics and social sciences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.