2021
DOI: 10.3390/electronics10172169
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Multilingual Stopwords Identification from Very Small Corpora

Abstract: Tools for Natural Language Processing work using linguistic resources, that are languagespecific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on stopwords, i.e., terms in a text which do not contribute in conveying its topic or content. It provide… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 28 publications
0
2
0
Order By: Relevance
“…The resources learned by BLA-BLA may be used as returned by the system, and/or be taken as a basis for further manual refinements. It currently includes several techniques that allow us to learn in a fully automatic way linguistic resources for language identification [26], stopword removal [27], term normalization [28], syntax checking [29] and concept taxonomies [30]. Whenever more texts become available for the language, it is easy to run BLA-BLA again and obtain updated resources.…”
Section: Bla-bla and Connektionmentioning
confidence: 99%
“…The resources learned by BLA-BLA may be used as returned by the system, and/or be taken as a basis for further manual refinements. It currently includes several techniques that allow us to learn in a fully automatic way linguistic resources for language identification [26], stopword removal [27], term normalization [28], syntax checking [29] and concept taxonomies [30]. Whenever more texts become available for the language, it is easy to run BLA-BLA again and obtain updated resources.…”
Section: Bla-bla and Connektionmentioning
confidence: 99%
“…For the text processing tasks, Jurish et al [16] proposed a Hidden Markov Model-based approach to segment automatically text documents into tokens and sentences. Ferilli [17] presented a term-document frequency approach that automatically detects stopwords from a small amount of the corpora and stated that it was the most effective approach which outperformed the classic term frequency [18,2,19] and the normalized inverse document frequency of Lo et al [20]. Trishala and Mamatha [21] proposed a rule-based Kannada stemmer relying on an unsupervised approach using k-means algorithm, and Thangarasu and Inbarani [22] presented an analogy removal stemmer that automatically stem Tamil words from the text corpora.…”
Section: Related Workmentioning
confidence: 99%