2013
DOI: 10.1093/llc/fqt038
|View full text |Cite
|
Sign up to set email alerts
|

DeRiK: A German reference corpus of computer-mediated communication

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0
2

Year Published

2013
2013
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 1 publication
0
8
0
2
Order By: Relevance
“…In particular, a <post> element to model the basic building block of CMC communication, along with several attributes such as @replyTo, and @in-dentLevel, was introduced by the TEI CMC SIG. The French and German projects that contributed to the TEI CMC SIG and in which CMC-specific TEI customisations were developed, were concerned with building CMC corpora of multiple genres (Beißwenger et al 2012, Lüngen et al 2016.…”
Section: Tei -Text Encoding Initiativementioning
confidence: 99%
“…In particular, a <post> element to model the basic building block of CMC communication, along with several attributes such as @replyTo, and @in-dentLevel, was introduced by the TEI CMC SIG. The French and German projects that contributed to the TEI CMC SIG and in which CMC-specific TEI customisations were developed, were concerned with building CMC corpora of multiple genres (Beißwenger et al 2012, Lüngen et al 2016.…”
Section: Tei -Text Encoding Initiativementioning
confidence: 99%
“…8 According to an informal and otherwise unpublished description on the ClueWeb09 web page, it was crawled with Nutch and "best-first search, using the OPIC metric", which is biased toward documents which are relevant to search engine applications. 9 Finally, there are completely different kinds of web corpus projects, which use stratified non-random sampling as mentioned in Section 2.1, such as the German DeRiK corpus project of computer-mediated communication [16]. Such corpora are designed with a specific purpose in mind, and they are orders of magnitude smaller than the large crawled web corpora we discuss in this paper.…”
Section: Strategies Used By Existing Web Corpus Projectsmentioning
confidence: 99%
“…In the first pipeline step, those different input formats are read and split into parts that can be processed in parallel. For ARC/WARC, this is done by reading input archives using the open-source library JWAT 16 and generating a split for each archive record. Although web archives can contain text documents in virtually any format such as PDF, word processor, presentations, or even images, only HTML documents are used for processing.…”
Section: Webcorpus Projectmentioning
confidence: 99%
“…European Computer-Mediated Communication (CMC) and Mediated Digital Discourse (MDD) corpora initiatives are becoming more visible: Belgian sms4science, Vos Pouces, (Fairon et al, 2006;Cougnon, 2015;Cougnon and Fairon, 2014;Cougnon et al, 2017); Dutch SoNaR, (Oostdijk et al, 2008); French CoMeRe, (Chanier et al, 2014); German DeRik, (Beißwenger et al, 2013); Swiss What's up Switzerland?, (Ueberwasser and Stark, 2017;Frey et al, 2016). These data types are often difficult to process, standardize, analyze, owing to their complex nature, including 'noisy' content (Frey et al, 2019;Poudat et al, 2020).…”
Section: Introductionmentioning
confidence: 99%