This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F 1 -score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.
This paper presents an annotation approach to examine uncertainty in British and German newspaper articles on the
coronavirus pandemic. We develop a tagset in an interdisciplinary team from corpus linguistics and sociology. After working out a
gold standard on a pilot corpus, we apply the annotation to the entire corpus drawing on an “annotation-by-query” approach in
CQPWeb, based on uncertainty constructions that have been extracted from the gold standard data. The
annotated data are then evaluated and sociologically contextualised. On this basis, we study the development of uncertainty
markers in the period under study and compare media discourses in Germany and the UK. Our findings reflect the different courses
of the pandemic in Germany and the UK as well as the different political responses, media traditions and cultural concerns: While
markers of fear are more important in British discourse, we see a steadily increasing level of disagreement in German discourse.
Other forms of uncertainty such as ‘possibility’ or ‘probability’ are similarly frequent in both discourses.
In this paper, we present our systems submitted to SemEval-2021 Task 1 on lexical complexity prediction (Shardlow et al., 2021a). The aim of this shared task was to create systems able to predict the lexical complexity of word tokens and bigram multiword expressions within a given sentence context, a continuous value indicating the difficulty in understanding a respective utterance. Our approach relies on gradient boosted regression tree ensembles fitted using a heterogeneous feature set combining linguistic features, static and contextualized word embeddings, psycholinguistic norm lexica, WordNet, word-and character bigram frequencies and inclusion in word lists to create a model able to assign a word or multiword expression a context-dependent complexity score. We can show that especially contextualised string embeddings (Akbik et al., 2018) can help with predicting lexical complexity.
Abstract. Knowledge about Theme-Rheme serves the interpretation of a text in terms of its thematic progression and provides a window into the topicality of a text as well as text type (genre). This is potentially relevant for NLP tasks such as information extraction and text classification. To explore this potential, large corpora annotated for Theme-Rheme organization are needed. We report on a rule-based system for the automatic identification of Theme to be employed for corpus annotation. The rules are manually derived from a set of sentences parsed syntactically with the Stanford parser and analyzed in terms of Theme on the basis of Systemic Functional Grammar (SFG). We describe the development of the rule set and the automatic procedure of Theme identification and assess the validity of the approach by application to some authentic text data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.