Proceedings of the International Conference on Agents and Artificial Intelligence 2015
DOI: 10.5220/0005194303530360
|View full text |Cite
|
Sign up to set email alerts
|

Stopwords Identification by Means of Characteristic and Discriminant Analysis

Abstract: Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at aut… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 8 publications
0
2
0
Order By: Relevance
“…Many methods have been used to develop stoplists. Some of them are frequency-based approach [13], Bidirectional Long Short term memory (BiLSTM) [14], Word Embedding [15], Finite Automata [16], and utilizing characteristic and discriminant analysis [17]. The dataset or corpus used to extract or identify stopwords vary.…”
Section: Related Workmentioning
confidence: 99%
“…Many methods have been used to develop stoplists. Some of them are frequency-based approach [13], Bidirectional Long Short term memory (BiLSTM) [14], Word Embedding [15], Finite Automata [16], and utilizing characteristic and discriminant analysis [17]. The dataset or corpus used to extract or identify stopwords vary.…”
Section: Related Workmentioning
confidence: 99%
“…Stopwords are usually removed in the text preprocessing stage (Rajaraman and Ullman 2011), so that text models can focus on the distinctive words for better performance (Babar and Patil 2015;Raulji and Saini 2016). Otherwise, these nondistinctive words (i.e., stopwords) with high number of occurrences may distort the results of a machine learning algorithm (Armano, Fanni, and Giuliani 2015), especially in information retrieval (Zaman, Matsakis, and Brown 2011) and topic modeling (Wallach 2006). Removing stopwords greatly reduces the number of total words ("tokens") but not significantly reduces the number of distinct words, that is, vocabulary size (Manning et al 2008).…”
Section: Removing Stopwordsmentioning
confidence: 99%