Generation, implementation, and appraisal of an N-gram-based stemming algorithm

Pande, Bhagwati Prasad; Tamta, Pawan; Dhami, H.S.

doi:10.1093/llc/fqy053

Cited by 7 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…, ( Oard, Levow & Cabezas, 2001 ; Goldsmith, 2001 ; Paik et al, 2011 ). In the character n -gram based method, adjacent characters in a length of n from the words in a corpus are considered to have less frequency whereas the variants have higher frequencies ( McNamee & Mayfield, 2004 ; Ahmed & Nrnberger, 2009 ; Pande, Tamta & Dhami, 2018 ). Also, various studies on corpus-based stemming using co-occurrence analysis and machine learning techniques are presented ( Paik, Pal & Parui, 2011 ; Paik et al, 2013 ; Brychcn & Konopk, 2015 ).…”

Section: Related Workmentioning

confidence: 99%

A selective approach to stemming for minimizing the risk of failure in information retrieval systems

Göksel

Arslan

Dinçer

2023

PeerJ Computer Science

View full text Add to dashboard Cite

Stemming is supposed to improve the average performance of an information retrieval system, but in practice, past experimental results show that this is not always the case. In this article, we propose a selective approach to stemming that decides whether stemming should be applied or not on a query basis. Our method aims at minimizing the risk of failure caused by stemming in retrieving semantically-related documents. The proposed work mainly contributes to the IR literature by proposing an application of selective stemming and a set of new features that derived from the term frequency distributions of the systems in selection. The method based on the approach leverages both some of the query performance predictors and the derived features and a machine learning technique. It is comprehensively evaluated using three rule-based stemmers and eight query sets corresponding to four document collections from the standard TREC and NTCIR datasets. The document collections, except for one, include Web documents ranging from 25 million to 733 million. The results of the experiments show that the method is capable of making accurate selections that increase the robustness of the system and minimize the risk of failure (i.e., per query performance losses) across queries. The results also show that the method attains a systematically higher average retrieval performance than the single systems for most query sets.

show abstract

Section: Related Workmentioning

confidence: 99%

A selective approach to stemming for minimizing the risk of failure in information retrieval systems

Göksel

Arslan

Dinçer

2023

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…Sadia et al [61] used an N-gram-based technique and tested on Bangla language. Pande et al [62] also used an N-gram technique to develop a stemmer and frequency of the N-gram to determine the stem's possibility. Dadashkarimi et al [63] proposed a statistical stemmer to extract the root from the inflectional and derivational forms of the word.…”

Section: B Statistical-based Approachesmentioning

confidence: 99%

“…Finally, the longest subsequence common to all its elements is returned as a stem. Pande et al [66] used 4-gram as an initial prediction for the stem. The given word is tokenized 4-gram, 5-gram, 6gram up to word length.…”

Section: B Statistical-based Approachesmentioning

confidence: 99%

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Jabbar,

Iqbal,

Tamimy

et al. 2023

IEEE Access

View full text Add to dashboard Cite

The exponential increase in textual unstructured digital data creates significant demand for advanced and smart stemming systems. As a preprocessing stage, stemming is applied in various research fields such as information retrieval (IR), domain vocabulary analysis, and feature reduction in many natural language processing (NLP). Text stemming (TS), an important step, can significantly improve performance in such systems. Text-stemming methods developed till now could be better in their results and can produce errors of different types leading to degraded performance of the applications in which these are used. This work presents a systematic study with an in-depth review of selected stemming works published from 1968 to 2023. The work presents a multidimensional review of studied stemming algorithms i.e., methodology, data source, performance, and evaluation methods. For this study, we have chosen different stemmers, which can be categorized as 1) linguistic knowledge-based, 2) statistical, 3) corpus-based, 4) context-sensitive, and 5) hybrid stemmers. The study shows that linguistic knowledge-based stemming techniques were widely used for highly inflected languages (such as Arabic, Hindi, and Urdu) and have reported higher accuracy than other techniques. We compare and analyze the performance of various state-of-the-art TS approaches, including their issues and challenges, which are summarized as research gaps. This work also analyzes different NLP applications utilizing stemming methods. At the end, we list the future work directions for interested researchers.

show abstract

“…Most methods remove affixes but after the implementation of certain statistical procedures. In this group we can find the following text stemmers: N-grams [7] stemmer regardless of the language in which the approach of the string-similarity is used to convert the word inflation in its root. An N-gram is a set of consecutive characters of n in a word.…”

Section: Related Workmentioning

confidence: 99%

Improving a Lightweight Stemmer for Gujarati Language

Chandrakant¹,

Patel²

2016

IJIST

View full text Add to dashboard Cite

The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.

show abstract

Generation, implementation, and appraisal of an N-gram-based stemming algorithm

Cited by 7 publications

References 14 publications

A selective approach to stemming for minimizing the risk of failure in information retrieval systems

A selective approach to stemming for minimizing the risk of failure in information retrieval systems

An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems

Improving a Lightweight Stemmer for Gujarati Language

Contact Info

Product

Resources

About