In this paper we outline a number of issues and problems which arise during the process of contrastive human-coded corpus annotation of certain semantic and discourse categories within the framework of the CONTRANOT project, aimed at the creation and validation of contrastive functional descriptions through corpus analysis and annotation. Human-coded corpus annotation is a preliminary step for the training of computer algorithms which allow the automation of the annotation of large corpora, but it can also serve as a mechanism for testing aspects of linguistic theories empirically, such as theory formation and theory-redefinition, as well as enriching theories with quantitative information. The work reported in this paper focuses on the annotation of the category of Thematisation, on the one hand, and on Modality, on the other, to illustrate the challenges researchers have to face when confronted with the task of developing well-designed and reliable annotation procedures for complex linguistic phenomena in a contrastive manner. We describe the annotation tasks and procedures developed so far, which include the design of annotation schemas on the basis of available linguistic theories and the testing of their reliability through agreement studies. We also evaluate and discuss the results of the annotations on the basis of their relevance for the theoretical characterisation of the investigated phenomena. We expect that our work will have an impact in the area of contrastive textual analysis, and that it will pave the way for the development of automated annotation systems for computational applications.
This chapter reports on the contrastive analysis of interpersonal discourse markers (IDMs) in a sample of English and Spanish newspaper texts in three genres: news reports, editorials and letters to the editor. The sample was divided into a training dataset of eighteen (English-Spanish) comparable texts and a larger dataset of 220 texts, divided into 60 news reports, 60 editorials and 100 letters to the editor. Following the methodology of Hovy & Lavid (2010), we present a preliminary annotation scheme validated by an inter-annotation agreement study. We then present the results of annotating the larger dataset, which reveals genre-related and language-specific variation in the distribution of IDMs in these newspaper genres. We discuss and provide some possible explanations for the results obtained.
This chapter summarises and discusses recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematization, as carried out within the framework of the MULTINOT project. The annotation of these complex features in bilingual parallel texts poses important challenges for the researcher at the different stages of the corpus development, from the preprocessing phases to the manual annotation phase, but, at the same time, it allows the investigation of complex linguistic research questions which could not be addressed on the basis of raw corpora or even with the help of an automatic part-of-speech tagging system.
The study and annotation of discourse markers (DMs) in the context of translation is a much needed and challenging task not only for descriptive translation studies, but also for Natural Language Processing (NLP) applications. Their various meanings are difficult to identify and annotate, even for trained human experts. In this chapter, a methodology for the analysis and annotation of DMs is proposed, using three highly frequent DMs in English -in fact, actually and really- and their translations into Spanish as a case study. The methodology consists of an initial corpus analysis phase followed by a corpus annotation phase. The corpus analysis provides qualitative and quantitative information on the meanings of these DMs by looking at their translations in large parallel corpora. The corpus annotation phase specifies the annotation procedure, which can be generalized to other DMs and to other language pairs, and form the basis for large-scale cross-linguistic annotation of DMs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.