Konsequenzen der "los"-Suffigierung im Deutschen: Korpushäufigkeit, emotional-affektive Effekte und konstruktionsgrammatische Perspektiven

Wolfer, Sascha; Hein, Katrin

doi:10.3726/zwjw.2022.02.03

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction1

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

Publication Types

Select...

Preprint1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

(1 citation statement)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We give pointers on how to back-translate the integer codes (see Section 2 for an explanation) to human-readable wordforms and lemmas, aggregate, lower (i.e. transform all characters to lower-case), and clean the dataset and search for speci c patterns based on a linguistic example from Wolfer & Hein (2022). For Python, we also show how to train smoothed n-gram language models with DeReKoGram.…”

Section: Introductionmentioning

confidence: 99%

Introducing DeReKoGram: A novel frequency dataset with lemma and part- of-speech information for German

Wolfer

Koplenig

Kupietz

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g. to save computational resources). In a case study, we investigate the growth of the vocabulary (as well as the number of hapax legomena) as more and more folds are included into the analysis. We cross-combine this with several cleaning stages of the dataset. We also give some guidance in the form of Python, R and Stata markdown scripts on how to work with the resource.

show abstract

Section: Introductionmentioning

confidence: 99%