Proceedings of the 10th Web as Corpus Workshop 2016
DOI: 10.18653/v1/w16-2612
|View full text |Cite
|
Sign up to set email alerts
|

On Bias-free Crawling and Representative Web Corpora

Abstract: In this paper, I present a specialized opensource crawler that can be used to obtain bias-reduced samples from the web. First, I briefly discuss the relevance of bias-reduced web corpus sampling for corpus linguistics. Then, I summarize theoretical results that show how commonly used crawling methods obtain highly biased samples from the web. The theoretical part of the paper is followed by a description my feature-complete and stable ClaraX crawler which performs so-called Random Walks, a form of crawling tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(14 citation statements)
references
References 17 publications
0
14
0
Order By: Relevance
“…For both snowclones, we queried three different corpora: The Corpus of Historical American English (COHA, Davies 2010), the Corpus of Contemporary American English (COCA, Davies 2008), and the web corpus ENCOW16A (Schäfer 2015). While the data from the two former as well as the ENCOW data for [the mother of all X] were taken into account exhaustively, we worked with a sample of 5,000 instances for the ENCOW data for [X BE the new Y].…”
Section: Methodsmentioning
confidence: 99%
“…For both snowclones, we queried three different corpora: The Corpus of Historical American English (COHA, Davies 2010), the Corpus of Contemporary American English (COCA, Davies 2008), and the web corpus ENCOW16A (Schäfer 2015). While the data from the two former as well as the ENCOW data for [the mother of all X] were taken into account exhaustively, we worked with a sample of 5,000 instances for the ENCOW data for [X BE the new Y].…”
Section: Methodsmentioning
confidence: 99%
“…One LSTM network was trained on a set of sentences extracted from the NLCOW2014 corpus, which comprises individual sentences of Dutch texts collected from the World Wide Web (Schäfer, 2015). Only the first slice, with approximately 37 million sentences, was used in the current research.…”
Section: The Neural Networkmentioning
confidence: 99%
“…The different Web corpora are used in both language research and NLP (Natural Language Processing) (Kilgarriff & Grefenstette 2003). For instance, The COW (COrpora from the web) is a result of a project that has the goal of determining the value of linguistic material collected from the Internet (Schäfer 2016) for fundamental linguistic research (Schäfer & Bildhauer 2012). OSCAR (Open Superlarge Crawled ALMAnaCH Corpus), on the other hand, is a huge multilingual corpus obtained by language identification (Scheible et al, 2020;Suarez et al, 2020), filtering of Common Crawl data without any metadata and intended to be used in the training of different language models for NLP (Suarez et al, 2019).…”
Section: Introductionmentioning
confidence: 99%