This paper presents an integrated language model to improve document relevancy for text-queries. To be precise, an integrated stemming-lemmatization (S-L) model was developed and its retrieval performance was compared at three document levels, that is, at top 5, 10 and 15. A prototype search engine was developed and fifteen queries were executed. The mean average precisions revealed the S-L model to outperform the baseline (i.e. no language processing), stemming and also the lemmatization models at all three levels of the documents. These results were also supported by the histogram precisions which illustrated the integrated model to improve the document relevancy. However, it is to note that the precision differences between the various models were insignificant. Overall the study found that when language processing techniques, that is, stemming and lemmatization are combined, more relevant documents are retrieved.Keywords: Information retrieval, document relevancy, language modeling, stemming, lemmatization, mean average precision
INTRODUCTIONThe use of internet all over the world has caused information size to increase, hence making it possible for large volumes of information to be retrieved by the users. However, this phenomenon also makes it difficult for users to find relevant information, therefore proper information retrieval techniques are needed. Information retrieval can be defined as "a problem-oriented discipline concerned with the problem of the effective and efficient transfer of desired information between human generator and human user" [1]. In short, information retrieval aims to provide users with those documents that will satisfy their information need.Many information retrieval algorithms were proposed, and some of the popular ones include the traditional Boolean model (i.e. based on binary decisions), vector space model (i.e. compares user queries with documents found in collections and computes their similarities), and probabilistic model (i.e. based on the probability theory to model uncertainties involved in retrieving data), among others. Over the years, information retrieval has evolved to include text retrieval in different languages, and thus giving birth to language models. The language model is particularly concerned with identifying how likely it is for a particular string in a specific language to be repeated [2]. A popular technique used in the language model is the N-gram model which predicts a preceding word based on previous N-1 words [3]. Other popular techniques include stemming and lemmatization.