This paper reports the latest performance of components and features of a project named Corpus-Centered Computation (C'3), which targets a translation technology suitable for spoken language translation. C3 places corpora at the center of the technology. Translation knowledge is extracted from corpora by both EBMT and SMT methods, translation quality is gauged by referring to corpora, the best translation among multiple-engine outputs is selected based on corpora and the corpora themselves are paraphrased or filtered by automated processes.
We propose a method to split and translate input sentences for speech translation in order to overcome the long sentence problem. This approach is based on three criteria used to judge the goodness of translation results. The criteria utilize the output of an MT system only and assumes neither a particular language nor a particular MT approach. In an experiment with an EBMT system, in which prior methods cannot work or work badly, the proposed split-and-translate method achieves much better results in translation quality.
An Example-Based Machine Translation (EBMT) system, whose translation example unit is a sentence, can produce an accurate and natural translation if translation examples similar enough to an input sentence are retrieved. Such a system, however, suffers from the problem of narrow coverage. To reduce the problem, a large-scale parallel corpus is required and, therefore, an efficient method is needed to retrieve translation examples from a large-scale corpus. The authors propose an efficient retrieval method for a sentence-wise EBMT using edit-distance. The proposed retrieval method efficiently retrieves the most similar sentences using the measure of edit-distance without omissions. The proposed method employs search-space division, word graphs, and an A* search algorithm. The performance of the EBMT was evaluated through Japanese-to-English translation experiments using a bilingual corpus comprising hundreds of thousands of sentences from a travel conversation domain. The EBMT system achieved a high-quality translation ability by using a large corpus and also achieved efficient processing by using the proposed retrieval method.
In order to boost the translation quality of EBMT based on a small-sized bilingual corpus, we use an out-of-domain bilingual corpus and, in addition, the language model of an indomain monolingual corpus. We conducted experiments with an EBMT system. The two evaluation measures of the BLEU score and the NIST score demonstrated the effect of using an out-of-domain bilingual corpus and the possibility of using the language model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.