Stopwords Identification by Means of Characteristic and Discriminant Analysis

Armano, Giuliano; Fanni, Francesca; Giuliani, Alessandro

doi:10.5220/0005194303530360

Cited by 4 publications

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many methods have been used to develop stoplists. Some of them are frequency-based approach [13], Bidirectional Long Short term memory (BiLSTM) [14], Word Embedding [15], Finite Automata [16], and utilizing characteristic and discriminant analysis [17]. The dataset or corpus used to extract or identify stopwords vary.…”

Section: Related Workmentioning

confidence: 99%

Automatic Extraction of Indonesian Stopwords

Achsan¹,

Suhartanto²,

Wibowo³

et al. 2023

IJACSA

View full text Add to dashboard Cite

The rapid growth of the Indonesian language content on the Internet has drawn researchers' attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very timeconsuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency -Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required. There are two novelties or contributions in this method: it can show all words found in all documents, and it has an automatic cut-off technique for selecting the top rank of stopwords candidates in the Indonesian language, overcoming one of the most challenging aspects of stopwords extraction.

show abstract

Section: Related Workmentioning

confidence: 99%

Automatic Extraction of Indonesian Stopwords

Achsan¹,

Suhartanto²,

Wibowo³

et al. 2023

IJACSA

View full text Add to dashboard Cite

show abstract

“…Stopwords are usually removed in the text preprocessing stage (Rajaraman and Ullman 2011), so that text models can focus on the distinctive words for better performance (Babar and Patil 2015;Raulji and Saini 2016). Otherwise, these nondistinctive words (i.e., stopwords) with high number of occurrences may distort the results of a machine learning algorithm (Armano, Fanni, and Giuliani 2015), especially in information retrieval (Zaman, Matsakis, and Brown 2011) and topic modeling (Wallach 2006). Removing stopwords greatly reduces the number of total words ("tokens") but not significantly reduces the number of distinct words, that is, vocabulary size (Manning et al 2008).…”

Section: Removing Stopwordsmentioning

confidence: 99%

Comparison of text preprocessing methods

Chai

2022

Nat. Lang. Eng.

View full text Add to dashboard Cite

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

show abstract

Defining and Identifying Stophashtags in Instagram

Giannoulakis

Tsapatsoulis

2016

Advances in Big Data

View full text Add to dashboard Cite

Stopwords Identification by Means of Characteristic and Discriminant Analysis

Cited by 4 publications

References 8 publications

Automatic Extraction of Indonesian Stopwords

Automatic Extraction of Indonesian Stopwords

Comparison of text preprocessing methods

Defining and Identifying Stophashtags in Instagram

Contact Info

Product

Resources

About