Abstract-This paper shows that the automatic labeling of discourse connectives with the relations they signal, prior to machine translation (MT), can be used by phrase-based statistical MT systems to improve their translations. This improvement is demonstrated here when translating from English to four target languages -French, German, Italian and Arabic -using several test sets from recent MT evaluation campaigns. Using automatically labeled data for training, tuning and testing MT systems is beneficial on condition that labels are sufficiently accurate, typically above 70%. To reach such an accuracy, a large array of features for discourse connective labeling (morphosyntactic, semantic and discursive) are extracted using stateof-the-art tools and exploited in factored MT models. The translation of connectives is improved significantly, between 0.7% and 10% as measured with the dedicated ACT metric. The improvements depend mainly on the level of ambiguity of the connectives in the test sets.
The various meanings of discourse connectives like while and however are difficult
to identify and annotate, even for trained human annotators. This problem is all the
more important that connectives are salient textual markers of cohesion and need to be
correctly interpreted for many NLP applications. In this paper, we suggest an
alternative route to reach a reliable annotation of connectives, by making use of the
information provided by their translation in large parallel corpora. This method thus
replaces the difficult explicit reasoning involved in traditional sense annotation by an
empirical clustering of the senses emerging from the translations. We argue that this
method has the advantage of providing more reliable reference data than traditional
sense annotation. In addition, its simplicity allows for the rapid constitution of large
annotated datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.