Abstract:Abstract. The language model is an important component of any speech recognition system. In this paper, we present a lexical enrichment methodology of corpora focused o n the construction of statistical language models. This methodology co nsiders, on one hand, the identification of the set of poor represented words of a given training corpus, and on the other hand, the enrichment of the given co rpus by the repetitive inclusion of selected text fragments containing these words. The first part of the paper des… Show more
“…This combination is based on the pertinence of the translations to the target document collection. This pertinence, as in the previous method, expresses how a given translation fits in 3 The n-gram model was constructed using the method described in [15].…”
Section: Methods 2: "Combining Passages From Several Translations"mentioning
Abstract.One major problem of state-of-the-art Cross Language Question Answering systems is the translation of user questions. This paper proposes combining the potential of multiple translation machines in order to improve the final answering precision. In particular, it presents three different methods for this purpose. The first one focuses on selecting the most fluent translation from a given set; the second one combines the passages recovered by several question translations; finally, the third one constructs a new question reformulation by merging word sequences from different translations. Experimental results demonstrated that the proposed approaches allow reducing the error rates in relation to a monolingual question answering exercise.
“…This combination is based on the pertinence of the translations to the target document collection. This pertinence, as in the previous method, expresses how a given translation fits in 3 The n-gram model was constructed using the method described in [15].…”
Section: Methods 2: "Combining Passages From Several Translations"mentioning
Abstract.One major problem of state-of-the-art Cross Language Question Answering systems is the translation of user questions. This paper proposes combining the potential of multiple translation machines in order to improve the final answering precision. In particular, it presents three different methods for this purpose. The first one focuses on selecting the most fluent translation from a given set; the second one combines the passages recovered by several question translations; finally, the third one constructs a new question reformulation by merging word sequences from different translations. Experimental results demonstrated that the proposed approaches allow reducing the error rates in relation to a monolingual question answering exercise.
“…In particular, in Mexico there have been some interesting efforts related to the use of the web for the automatic construction of domain-specific ontologies [16], training sets for text classification tasks [6,7], and language models for speech recognition [28]. The following sections give a brief overview of these works.…”
Section: Extracting Information From the Webmentioning
confidence: 99%
“…The construction of this corpus is not a simple task since written texts do not represent adequately many phenomena of spontaneous speech. In order to alleviate this problem, [28] proposes the use of web documents as data source. This proposal was based on the fact that many people around the world contribute to create the web, and therefore, that most of its documents comprise informal contents and include many everyday as well as non-grammatical expressions used in spoken language.…”
Section: Tuning Task-specific Language Models Through Web Datamentioning
confidence: 99%
“…In particular, the method presented in [28] faces the problem of enlarging a given small task-specific corpus (called reference corpus). It considers the following main steps:…”
Section: Tuning Task-specific Language Models Through Web Datamentioning
“…In addition to Keller and Lapata (this issue) and references therein, Volk (2001) gathers lexical statistics for resolving prepositional phrase attachments, and Villasenor-Pineda et al (2003) "balance" their corpus using Web documents.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.