This article describes a linguistically informed method for integrating phrasal verbs into statistical machine translation (SMT) systems. In a case study involving English to Bulgarian SMT, we show that our method does not only improve translation quality but also outperforms similar methods previously applied to the same task. We attribute this to the fact that, in contrast to previous work on the subject, we employ detailed linguistic information. We found out that features which describe phrasal verbs as idiomatic or compositional contribute most to the better translation quality achieved by our method.
We examine the employment of word embeddings for machine translation (MT) of phrasal verbs (PVs), a linguistic phenomenon with challenging semantics. Using word embeddings, we augment the translation model with two features: one modelling distributional semantic properties of the source and target phrase and another modelling the degree of compositionality of PVs. We also obtain paraphrases to increase the amount of relevant training data. Our method leads to improved translation quality for PVs in a case study with English to Bulgarian MT system.
We present a novel approach for creating sense annotated corpora automatically. Our approach employs shallow syntacticosemantic patterns derived from linked lexical resources to automatically identify instances of word senses in text corpora. We evaluate our labelling method intrinsically on SemCor and extrinsically by using automatically labelled corpus text to train a classifier for verb sense disambiguation. Testing this classifier on verbs from the English MASC corpus and on verbs from the Senseval-3 all-words disambiguation task shows that it matches the performance of a classifier which has been trained on manually annotated data.
In this paper we illustrate and underline the importance of making detailed linguistic information a central part of the process of automatic acquisition of large-scale lexicons as a means for enhancing robustness and at the same time ensuring maintainability and re-usability of deep lexicalised grammars. Using the error mining techniques proposed in (van Noord, 2004) we show very convincingly that the main hindrance to portability of deep lexicalised grammars to domains other than the ones originally developed in, as well as to robustness of systems using such grammars is low lexical coverage. To this effect, we develop linguistically-driven methods that use detailed morphosyntactic information to automatically enhance the performance of deep lexicalised grammars maintaining at the same time their usually already achieved high linguistic quality.
In recent studies it has been shown that syntax-based semantic space models outperform models in which the context is represented as a bag-of-words in several semantic analysis tasks. This has been generally attributed to the fact that syntax-based models employ corpora that are syntactically annotated by a parser and a computational grammar. However, if the corpora processed contain words which are unknown to the parser and the grammar, a syntax-based model may lose its advantage since the syntactic properties of such words are unavailable. On the other hand, bag-of-words models do not face this issue since they operate on raw, non-annotated corpora and are thus more robust. In this paper, we compare the performance of syntax-based and bag-of-words models when applied to the task of learning the semantics of unknown words. In our experiments, unknown words are considered the words which are not known to the Alpino parser and grammar of Dutch. In our study, the semantics of an unknown word is defined by finding its most similar word in CORNETTO, a Dutch lexico-semantic hierarchy. We show that for unknown words the syntax-based model performs worse than the bag-of-words approach. Furthermore, we show that if we first learn the syntactic properties of unknown words by an appropriate lexical acquisition method, then in fact the syntax-based model does outperform the bag-of-words approach. The conclusion we draw is that, for words unknown to a given grammar, a bag-of-words model is more robust than a syntax-based model. However, the combination of lexical acquisition and syntax-based semantic models is best suited for learning the semantics of unknown words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.