2022
DOI: 10.20944/preprints202204.0234.v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Automatic Detection of Stop Words for Texts in the Uzbek Language

Abstract: Stop words are very important for information retrieval and text analysis investigation. This study aimed to automatically analyze and detect stop words in texts in the Uzbek language. Because of the limited availability of methods for automatic search of stop words of texts in Uzbek we analyzed a newly prepared corpus. The Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop words in Uzbek texts is a more complex proces… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 3 publications
0
3
0
Order By: Relevance
“…Transliteration between Cyrillic and Latin alphabets of Uzbek language has been done by Mansurovs [9], who used a data-driven approach, by aligning words and training a decision-tree classifier. Among some other NLP work that has been done on low-resource Uzbek language so far, there are a morphological analyzer [10], WordNet type synsets [11], Uzbek stopwords dataset [12], sentiment analysis and text classification [13,14,15], cross-lingual word-embeddings [16], as well as a pretrained Uzbek language model based on the BERT architecture [17].…”
Section: Related Workmentioning
confidence: 99%
“…Transliteration between Cyrillic and Latin alphabets of Uzbek language has been done by Mansurovs [9], who used a data-driven approach, by aligning words and training a decision-tree classifier. Among some other NLP work that has been done on low-resource Uzbek language so far, there are a morphological analyzer [10], WordNet type synsets [11], Uzbek stopwords dataset [12], sentiment analysis and text classification [13,14,15], cross-lingual word-embeddings [16], as well as a pretrained Uzbek language model based on the BERT architecture [17].…”
Section: Related Workmentioning
confidence: 99%
“…Then, we applied stop words to remove low-level information words from our comments to focus on important information. The technique is based on [8] paper where it is a proposed algorithm of automatic detection of single word stop words collection using TFIDF(Term frequencyinverse document frequency). After that, each word is processed to lexicon-free stemming tool [7] algorithm for decreasing the word capacity because of prefixes and suffixes.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…The first steps are removing URLs, punctuation, and lower-casing. The second step is ignoring stopwords [8] from the dataset where it is based on accuracy evaluation after generating the list of stop words using the TF-IDF algorithm; Then, we applied the stemming algorithm [7,9] which is based on Uzbek words' endings' electronic dictionary that uses combinatorial approach inferring apply for part of speech of the Uzbek language: nouns, adjectives, numerals, verbs, participles, moods, voices. Advantages of using the algorithm are lexicon-free and its complexity that allows one operation (referring to the dictionary of endings of the language) to perform: segmentation of the word into suffixes; performs morphological analysis of the word.…”
Section: Introductionmentioning
confidence: 99%