This paper deals with the annotation of "aboutness topic" (also known as "sentence topic") in naturally occurring data. We report on two annotation experiments in which relatively poor inter-rater agreement was attained for the annotation of topics, although the coders were adhering to the same annotation instructions in each experiment. After presenting some theoretical background on the notion of topic in linguistics, we present the first experiment. Tokens that prove particularly difficult to assess in that experiment are identified, systematized, and discussed in some detail. In sum, the cases that were most likely to lead to non-matching annotations are those that either require a decision between "thetic" or "topic-comment", or involve an overlap between focus and topic. In order to try and increase inter-rater agreement, we modified the annotation guidelines; trying to eliminate some of the confounds from the first experiment. We then trained other annotators to use the modified guidelines and set them an annotation task. Again, the degree of inter-rater agreement was slightly disappointing. We discuss what we believe to be the problem cases in this task and give some guidance for future modification of the guidelines. The findings raise a number of issues that may contribute to the discussion in theoretical linguistics, and they also may alert other researchers planning a similar enterprise to some pitfalls they may encounter.
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not suitable or desirable for inclusion in web corpora or web indexes (for example, pages with little text or virtually no text at all). An optimized crawler for web corpus construction would ideally avoid crawling such content in the first place, saving bandwidth, storage, and post-processing costs. In this paper, we show in three experiments that two simple scores are suitable to improve the ratio between corpus size and crawling effort for web corpus construction. The first score is related to overall text quality of the page containing the link, the other one is related to the likelihood that the local block enclosing a link is boilerplate.
Dieses Papier diskutiert informationsstrukturelle Aspekte der mehrfachen Vorfeldbesetzung im Deutschen. Auf der Grundlage einer größtenteils aus den IDS-Korpora extrahierten Belegsammlung werden Diskursgegebenheit, Fokus-und Topikstatus (vor allem) des Vorfeldmaterials beschrieben und in Bezug zu entsprechenden Aussagen in der Literatur gesetzt. Neben informationsstrukturellen Faktoren werden im letzten Abschnitt mögliche weitere Faktoren angesprochen, die mehrfache Vorfeldbesetzung favorisieren könnten. Zudem werden für einen begrenzten Ausschnitt des Deutschen erstmals Zahlen vorgelegt, die das Verhältnis von mehrfacher Vorfeldbesetzung zur ähnlichen, aber als "kanonischer" geltenden Besetzung des Vorfelds mit einer (möglicherweise partiellen) Verbalphrase illustrieren. The present paper is a survey of the information-structural properties of multiple fronting constructions in German. Based on a collection of naturally occurring examples (for the most part extracted from the corpora hosted at the Institut für Deutsche Sprache in Mannheim), the prefield material is characterised with respect to givenness, topic and focus status, and the findings are discussed in the light of various proposals from the literature. The final section suggests that a number of other factors, not (or only indirectly) related to information structure, probably play a role in accounting for the phenomenon. In addition, quantitative evidence is presented which illustrates the relation between multiple fronting and VP-fronting for a number of selected structures.
In the present review paper by members of the collaborative research center “Register: Language Users' Knowledge of Situational-Functional Variation” (CRC 1412), we assess the pervasiveness of register phenomena across different time periods, languages, modalities, and cultures. We define “register” as recurring variation in language use depending on the function of language and on the social situation. Informed by rich data, we aim to better understand and model the knowledge involved in situation- and function-based use of language register. In order to achieve this goal, we are using complementary methods and measures. In the review, we start by clarifying the concept of “register”, by reviewing the state of the art, and by setting out our methods and modeling goals. Against this background, we discuss three key challenges, two at the methodological level and one at the theoretical level: (1) To better uncover registers in text and spoken corpora, we propose changes to established analytical approaches. (2) To tease apart between-subject variability from the linguistic variability at issue (intra-individual situation-based register variability), we use within-subject designs and the modeling of individuals' social, language, and educational background. (3) We highlight a gap in cognitive modeling, viz. modeling the mental representations of register (processing), and present our first attempts at filling this gap. We argue that the targeted use of multiple complementary methods and measures supports investigating the pervasiveness of register phenomena and yields comprehensive insights into the cross-methodological robustness of register-related language variability. These comprehensive insights in turn provide a solid foundation for associated cognitive modeling.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.