Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 2 2017
DOI: 10.18653/v1/e17-2069
|View full text |Cite
|
Sign up to set email alerts
|

Pulling Out the Stops: Rethinking Stopword Removal for Topic Models

Abstract: It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is timeconsuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
72
1
1

Year Published

2017
2017
2021
2021

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 143 publications
(74 citation statements)
references
References 4 publications
(1 reference statement)
0
72
1
1
Order By: Relevance
“…As both can be prone to misspellings, one has to check the documents carefully before applying a topic model 13 For tools to carry out these steps, see Graham et al (2016). For a discussion of the effect of stopword-removal, see Schofield et al (2017).…”
Section: Topic Models In Practicementioning
confidence: 99%
“…As both can be prone to misspellings, one has to check the documents carefully before applying a topic model 13 For tools to carry out these steps, see Graham et al (2016). For a discussion of the effect of stopword-removal, see Schofield et al (2017).…”
Section: Topic Models In Practicementioning
confidence: 99%
“…For our research purposes, LDA analysis using all the vocabulary words would not be appropriate, because the existence of many common or stop words often introduces meaningless or uninterpretable topics (Schofield et al 2017). Therefore, in this experiment, to remove common or stop words, the instructor and the tutor selected 382 representative words of the course.…”
Section: Corpus Of Reportsmentioning
confidence: 99%
“…The LDA algorithm exploits documental-level word co-occurrence patterns to discover underlying topics. Based on a prior study, we first removed stop words (e.g., "the", "a") and words that occurred £3 times in our corpus [25].…”
Section: Step 3: Topic Modelingmentioning
confidence: 99%