2016
DOI: 10.22452/mjcs.vol29no1.4
|View full text |Cite
|
Sign up to set email alerts
|

Improving Document Relevancy Using Integrated Language Modeling Techniques

Abstract: This paper presents an integrated language model to improve document relevancy for text-queries. To be precise, an integrated stemming-lemmatization (S-L) model was developed and its retrieval performance was compared at three document levels, that is, at top 5, 10 and 15. A prototype search engine was developed and fifteen queries were executed. The mean average precisions revealed the S-L model to outperform the baseline (i.e. no language processing), stemming and also the lemmatization models at all three l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(5 citation statements)
references
References 21 publications
0
5
0
Order By: Relevance
“…For European languages there are two types: Stemming, which is based on the reduction of words to a common steam by clipping off the unnecessary morphemes, and Lemmatisation, which is based on the clustering of words by morphemes guided by the knowledge by the computer program of the dictionary and morphology of the language for this association (Singh, Gupta, 2017:159). Although in principle one might think that lemmatisation could offer more reliable results, various studies such as those carried out by Kettunen, Kunttu and Järvelin (2005) and Balakrishnan, Humaidi and Lloyd-Yemoh (2016) point out that the results in different languages between both methods present insignificant differences. Anyways, stemming, the most used method in computer programs for Latin languages, can present two types of errors in certain cases: false positives and false negatives, respectively, with words that have an almost equal morphology and different meaning or polysemic words (Hajeer, Ismail, Badr, Tolba, 2017).…”
Section: Objectives and Methodsmentioning
confidence: 99%
“…For European languages there are two types: Stemming, which is based on the reduction of words to a common steam by clipping off the unnecessary morphemes, and Lemmatisation, which is based on the clustering of words by morphemes guided by the knowledge by the computer program of the dictionary and morphology of the language for this association (Singh, Gupta, 2017:159). Although in principle one might think that lemmatisation could offer more reliable results, various studies such as those carried out by Kettunen, Kunttu and Järvelin (2005) and Balakrishnan, Humaidi and Lloyd-Yemoh (2016) point out that the results in different languages between both methods present insignificant differences. Anyways, stemming, the most used method in computer programs for Latin languages, can present two types of errors in certain cases: false positives and false negatives, respectively, with words that have an almost equal morphology and different meaning or polysemic words (Hajeer, Ismail, Badr, Tolba, 2017).…”
Section: Objectives and Methodsmentioning
confidence: 99%
“…After that, all the tokens were converted to lowercase form before applying the lemmatization technique. Lemmatization, in general, uses vocabulary and morphological analysis of words to remove inflectional endings and convert them to their dictionary form Balakrishnan et al [ 46 ]. A stopwords list was applied to the lemmatized words, and then the length of each tweet was normalized using the L2 norm.…”
Section: Study Proceduresmentioning
confidence: 99%
“…All the tokens were transformed to a lowercase form before applying the lemmatisation technique. Lemmatisation, in general, uses vocabulary and morphological analysis of word and removes inflectional endings to convert words to a dictionary form (Balakrishnan, Humaidi, & Lloyd-Yemoh, 2016). The stop-words method was applied on the lemmatised words.…”
Section: Data Pre-processing and Text Clusteringmentioning
confidence: 99%