2018
DOI: 10.1093/llc/fqy053
|View full text |Cite
|
Sign up to set email alerts
|

Generation, implementation, and appraisal of an N-gram-based stemming algorithm

Abstract: A language independent stemmer has always been looked for. Single N-gram tokenization technique works well, however, it often generates stems that start with intermediate characters, rather than initial ones. We present a novel technique that takes the concept of N-gram stemming one step ahead and compare our method with an established algorithm in the field, Porter's Stemmer. Results indicate that our N-gram stemmer is not inferior to Porter's linguistic stemmer.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…, ( Oard, Levow & Cabezas, 2001 ; Goldsmith, 2001 ; Paik et al, 2011 ). In the character n -gram based method, adjacent characters in a length of n from the words in a corpus are considered to have less frequency whereas the variants have higher frequencies ( McNamee & Mayfield, 2004 ; Ahmed & Nrnberger, 2009 ; Pande, Tamta & Dhami, 2018 ). Also, various studies on corpus-based stemming using co-occurrence analysis and machine learning techniques are presented ( Paik, Pal & Parui, 2011 ; Paik et al, 2013 ; Brychcn & Konopk, 2015 ).…”
Section: Related Workmentioning
confidence: 99%
“…, ( Oard, Levow & Cabezas, 2001 ; Goldsmith, 2001 ; Paik et al, 2011 ). In the character n -gram based method, adjacent characters in a length of n from the words in a corpus are considered to have less frequency whereas the variants have higher frequencies ( McNamee & Mayfield, 2004 ; Ahmed & Nrnberger, 2009 ; Pande, Tamta & Dhami, 2018 ). Also, various studies on corpus-based stemming using co-occurrence analysis and machine learning techniques are presented ( Paik, Pal & Parui, 2011 ; Paik et al, 2013 ; Brychcn & Konopk, 2015 ).…”
Section: Related Workmentioning
confidence: 99%
“…Sadia et al [61] used an N-gram-based technique and tested on Bangla language. Pande et al [62] also used an N-gram technique to develop a stemmer and frequency of the N-gram to determine the stem's possibility. Dadashkarimi et al [63] proposed a statistical stemmer to extract the root from the inflectional and derivational forms of the word.…”
Section: B Statistical-based Approachesmentioning
confidence: 99%
“…Finally, the longest subsequence common to all its elements is returned as a stem. Pande et al [66] used 4-gram as an initial prediction for the stem. The given word is tokenized 4-gram, 5-gram, 6gram up to word length.…”
Section: B Statistical-based Approachesmentioning
confidence: 99%
“…Most methods remove affixes but after the implementation of certain statistical procedures. In this group we can find the following text stemmers: N-grams [7] stemmer regardless of the language in which the approach of the string-similarity is used to convert the word inflation in its root. An N-gram is a set of consecutive characters of n in a word.…”
Section: Related Workmentioning
confidence: 99%