2019
DOI: 10.1007/978-3-030-15712-8_49
|View full text |Cite
|
Sign up to set email alerts
|

Wikipedia Text Reuse: Within and Without

Abstract: We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl). To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline. We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

3
5

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 19 publications
0
9
0
1
Order By: Relevance
“…Tool-based editing to, for example, fix common misspellings, could skew grammatical error correction corpora built from Wikipedia (e.g., Lichtarge et al (2019)). Tools exist to identify these patterns (e.g., Alshomary et al (2019)) and a high-quality language model might seek to deduplicate this content or downweight it in training.…”
Section: Bots and Filtersmentioning
confidence: 99%
“…Tool-based editing to, for example, fix common misspellings, could skew grammatical error correction corpora built from Wikipedia (e.g., Lichtarge et al (2019)). Tools exist to identify these patterns (e.g., Alshomary et al (2019)) and a high-quality language model might seek to deduplicate this content or downweight it in training.…”
Section: Bots and Filtersmentioning
confidence: 99%
“…However, for the most part Wikipedia and other low quality digital repositories do not provide accurate data. [18] For this reason, the need to verify online content before using it on academic tasks is important.…”
Section: Academic Dishonestymentioning
confidence: 99%
“…The similarity between two passages t,t is then approximated by the extent of overlap between their hash sets, |h(t) ∩ h(t )|. Thus, a document d is considered a candidate source of text reuse for a document d, if at least one of their passage-level hash sets intersects 20…”
Section: Source Retrievalmentioning
confidence: 99%
“…Hash-based source retrieval allows for a significant reduction of the required computation time, since the hashbased approximation has a linear time complexity with respect to |D|, as opposed to the quadratic complexity of vector comparisons 20 . As a result, our source retrieval computation time could be fitted into the allotted budget of two months of computing time on a 130-node Apache Spark cluster, with 12 CPUs and 196 GB RAM per node.…”
Section: Source Retrievalmentioning
confidence: 99%