Ranking plays a key role in many applications, such as document retrieval, recommendation, question answering, and machine translation. In practice, a ranking function (or model) is exploited to determine the rank-order relations between objects, with respect to a particular criterion. In this paper, a layered multipopulation genetic programming based method, known as RankMGP, is proposed to learn ranking functions for document retrieval by incorporating various types of retrieval models into a singular one with high effectiveness. RankMGP represents a potential solution (i.e., a ranking function) as an individual in a population of genetic programming and aims to directly optimize information retrieval evaluation measures in the evolution process. Overall, RankMGP consists of a set of layers and a sequential workflow running through the layers. In one layer, multiple populations evolve independently to generate a set of the best individuals. When the evolution process is completed, a new training dataset is created using the best individuals and the input training set of the layer. Then, the populations in the next layer evolve with the new training dataset. In the final layer, the best individual is obtained as the output ranking function. The proposed method is evaluated using the LETOR datasets and is found to be superior to classical information retrieval models, such as Okapi BM25. It is also statistically competitive with the state-of-the-art methods, including Ranking SVM, ListNet, AdaRank and RankBoost.Keywords: learning to rank for information retrieval, ranking function, supervised learning, layered multipopulation genetic programming, LAGEP, LETOR
INTRODUCTIONOne central problem of information retrieval (IR) is the issue of determining which documents are relevant to the user's information needs, and which are not (i.e., finding the potentially relevant documents) [3]. In practice, it is usually addressed as a ranking problem, whose goal is, according to the degree of relevance (or similarity) between each document and the user's query, to define a total order of documents that ranks relevant documents in higher positions on the retrieved list than irrelevant ones.Traditional IR models, including the Boolean model, the vector space model, and the probabilistic model, are developed based on the bag-of-words model. In short, a document is decomposed into keywords (i.e., index terms) and a ranking function (or retrieval function) is defined to associate a relevance degree with the document and query [3]. The aforementioned models are typically realized in an unsupervised manner and thus, the parameters of underlying ranking functions are usually tuned empirically. However, manual tuning suffers high costs and sometimes leads to over-fitting, especially when the functions are carefully tuned to fit particular needs [30].Nowadays, as increasingly many IR results are accompanied by relevance judgments (e.g., query and clickthrough logs), supervised learning-based methods, referred to as "learning t...