Improving Unsupervised Stemming by using Partial Lemmatization Coupled with Data-based Heuristics for Hindi

Gupta, Deepa; Yadav, Rahul Kumar; Sajan, Nidhi

doi:10.5120/4625-6867

Cited by 9 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From Table 1, it can be noted that all the language processing models outperformed the baseline algorithm. This is expected as the baseline algorithm returns results based on the search query only, without taking any further processing into consideration such as stemming or lemmatization [20,29].…”

Section: Resultsmentioning

confidence: 99%

“…For instance, some studies found stemming used with clustering algorithms to be beneficial in English texts [26], and also other languages [27,28]. Gupta et al [29] combined stemming with partial lemmatization for Hindi language with results indicating significant improvements than other traditional approaches. Another study compared stemming and lemmatization in clustering Finnish text documents, with results indicating the use of lemmatization to be better than stemming [30].…”

Section: Lemmatizationmentioning

confidence: 99%

See 1 more Smart Citation

Improving Document Relevancy Using Integrated Language Modeling Techniques

Balakrishnan

Humaidi

Lloyd-Yemoh

2016

MJCS

View full text Add to dashboard Cite

This paper presents an integrated language model to improve document relevancy for text-queries. To be precise, an integrated stemming-lemmatization (S-L) model was developed and its retrieval performance was compared at three document levels, that is, at top 5, 10 and 15. A prototype search engine was developed and fifteen queries were executed. The mean average precisions revealed the S-L model to outperform the baseline (i.e. no language processing), stemming and also the lemmatization models at all three levels of the documents. These results were also supported by the histogram precisions which illustrated the integrated model to improve the document relevancy. However, it is to note that the precision differences between the various models were insignificant. Overall the study found that when language processing techniques, that is, stemming and lemmatization are combined, more relevant documents are retrieved.Keywords: Information retrieval, document relevancy, language modeling, stemming, lemmatization, mean average precision INTRODUCTIONThe use of internet all over the world has caused information size to increase, hence making it possible for large volumes of information to be retrieved by the users. However, this phenomenon also makes it difficult for users to find relevant information, therefore proper information retrieval techniques are needed. Information retrieval can be defined as "a problem-oriented discipline concerned with the problem of the effective and efficient transfer of desired information between human generator and human user" [1]. In short, information retrieval aims to provide users with those documents that will satisfy their information need.Many information retrieval algorithms were proposed, and some of the popular ones include the traditional Boolean model (i.e. based on binary decisions), vector space model (i.e. compares user queries with documents found in collections and computes their similarities), and probabilistic model (i.e. based on the probability theory to model uncertainties involved in retrieving data), among others. Over the years, information retrieval has evolved to include text retrieval in different languages, and thus giving birth to language models. The language model is particularly concerned with identifying how likely it is for a particular string in a specific language to be repeated [2]. A popular technique used in the language model is the N-gram model which predicts a preceding word based on previous N-1 words [3]. Other popular techniques include stemming and lemmatization.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Lemmatizationmentioning

confidence: 99%

Improving Document Relevancy Using Integrated Language Modeling Techniques

Balakrishnan

Humaidi

Lloyd-Yemoh

2016

MJCS

View full text Add to dashboard Cite

show abstract

“…The handcrafted rule based stemming approach is easy if developer has proper linguistic knowledge on the hand the lemmatization rule can be easily produced without linguistic knowledge provided the given training data is correct [45]. Further, a comparison of lemmatization and stemming was performed in the information retrieval of documents using clustering and the result depicts that lemmatization gives best performance as compared to stemming [46].…”

Section: Different Preprocessing Techniques Provide Different Classif...mentioning

confidence: 99%

Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

2023

IJATEE

View full text Add to dashboard Cite

show abstract

“…Additionally, they also found that the performance of information retrieval was better when the maximum length of lemmas is used. In 2012, Gupta et al [12] combined stemming and partial lemmatization and tested their model on the Hindi language. Their model yielded significant improvements compared to the traditional approaches.…”

Section: Introductionmentioning

confidence: 99%

Stemming and Lemmatization: A Comparison of Retrieval Performances

Balakrishnan¹,

Lloyd-Yemoh²

2014

LNSE

164

View full text Add to dashboard Cite

Abstract-The current study proposes to compare document retrieval precision performances based on language modeling techniques, particularly stemming and lemmatization. Stemming is a procedure to reduce all words with the same stem to a common form whereas lemmatization removes inflectional endings and returns the base or dictionary form of a word. Comparisons were also made between these two techniques with a baseline ranking algorithm (i.e. with no language processing). A search engine was developed and the algorithms were tested based on a test collection. Both mean average precisions and histograms indicate stemming and lemmatization to outperform the baseline algorithm. As for the language modeling techniques, lemmatization produced better precision compared to stemming, however the differences are insignificant. Overall the findings suggest that language modeling techniques improves document retrieval, with lemmatization technique producing the best result.

show abstract

Improving Unsupervised Stemming by using Partial Lemmatization Coupled with Data-based Heuristics for Hindi

Cited by 9 publications

References 7 publications

Improving Document Relevancy Using Integrated Language Modeling Techniques

Improving Document Relevancy Using Integrated Language Modeling Techniques

Improvisation in opinion mining using data preprocessing techniques based on consumer’s review

Stemming and Lemmatization: A Comparison of Retrieval Performances

Contact Info

Product

Resources

About