Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1290
|View full text |Cite
|
Sign up to set email alerts
|

Quantifying the Effects of Text Duplication on Semantic Models

Abstract: Duplicate documents are a pervasive problem in text datasets and can have a strong effect on unsupervised models. Methods to remove duplicate texts are typically heuristic or very expensive, so it is vital to know when and why they are needed. We measure the sensitivity of two latent semantic methods to the presence of different levels of document repetition. By artificially creating different forms of duplicate text we confirm several hypotheses about how repeated text impacts models. While a small amount of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(18 citation statements)
references
References 19 publications
0
18
0
Order By: Relevance
“…This characteristic is largely absent in data collected for the Internet Archive, where only 22 tweets were authored by @internetarchive, of which nine were original tweets. The sensitivity of latent dirichlet allocation (LDA) topic modeling to repeated text has been turned to advantage in the topic model of the aggregated corpus to reveal the most retweeted tweets, many of which originated from the archives’ accounts (Schofield, Thompson, & Mimno, ). Repeated text in this study has been interpreted as a measure of user interest in the subject matter.…”
Section: Methodsmentioning
confidence: 99%
“…This characteristic is largely absent in data collected for the Internet Archive, where only 22 tweets were authored by @internetarchive, of which nine were original tweets. The sensitivity of latent dirichlet allocation (LDA) topic modeling to repeated text has been turned to advantage in the topic model of the aggregated corpus to reveal the most retweeted tweets, many of which originated from the archives’ accounts (Schofield, Thompson, & Mimno, ). Repeated text in this study has been interpreted as a measure of user interest in the subject matter.…”
Section: Methodsmentioning
confidence: 99%
“…To examine this phenomenon in isolation, we repeat the training corpus twice and observe the effect of diversity-aware strategies. The corpus duplication technique has been previously used to probe semantic models (Schofield et al, 2017). Figure 3 shows learning curves for strategies under the original and corpus duplication settings.…”
Section: Corpus Duplication Settingmentioning
confidence: 99%
“…Schofield et al [17] show learning on how repeated texts affect semantic models. They trained a Latent Dirichlet Allocation model and Latent Semantic Analysis model over different levels of repeated texts.…”
Section: Related Workmentioning
confidence: 99%