This chapter addresses the requirements and linguistic foundations of automatic relational discourse analysis of complex text types such as scientific journal articles. It is argued that besides lexical and grammatical discourse markers, which have traditionally been employed in discourse parsing, cues derived from the logical and generical document structure and the thematic structure of a text must be taken into account. An approach to modelling such types of linguistic information in terms of XML-based multi-layer annotations and to a text-technological representation of additional knowledge sources is presented. By means of quantitative and qualitative corpus analyses, cues and constraints for automatic discourse analysis can be derived. Furthermore, the proposed representations are used as the input sources for discourse parsing. A short overview of the projected parsing architecture is given.
Keywords Discourse parsing • Discourse relations • Document structure • Text technology • Linguistic annotations • XML
IntroductionIn the past, several approaches to automatic discourse analysis have been developed as applications of relational discourse theories which describe the semantics of discourse. These approaches are often based on the analysis of discourse connectives as well as morphological and syntactic features. Such surface-oriented strategies are adequate and have yielded good results when applied to the analysis of simple text types like newspaper articles, which are characterised by a limited size and a relatively simple document and syntactic structure. When dealing with more complex text types, however, an analysis of lexis and grammar is not sufficient. Sources of knowledge about discourse and document semantics have to be considered as well.H. Lüngen (m ) Justus-Liebig-Universität Gießen. Gießen, Germany e-mail: luengen@uni-giessen. de Published in: Witt, Andreas/Metzing, Dieter (eds.): Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology. -Dordrecht: Springer, 2010. pp. 97-123. (Text, Speech and Language Technology 41) This chapter deals with the linguistic foundations of discourse analysis for a complex text type by the example of scientific journal articles. Its focus is on the contribution of logical document structure, generic document structure and thematic structure to discourse parsing. The modelling and representation of linguistic structures and knowledge sources based on text-technological (XML-based) formalisms and methods is addressed. The representations are used in investigating correlations and interactions between different types of linguistic information and serve as an input to a discourse parsing System.In the project SemDok, which is part of the Research Group Text-technological modelling o f information funded by the German Research Foundation DFG and scheduled to run in its second phase for three years 2005-2008, a discourse parser for the complex text type "scientific research article" is being developed. Scientific ...