Due to the exponential increase in the information on the web, extracting relevant documents for users in a reasonable time becomes a cumbersome task. Also, when user feedback is scarce or unavailable, content-based approaches to extract and rank relevant documents are critical as they suffer from the problem of determining semantic similarity between texts of user queries and documents. Various sentence embedding models exist today that acquire deep semantic representations through training on a large corpus, with the goal of providing transfer learning to a broad range of natural language processing tasks such as document similarity, text summarization, text classification, sentiment analysis, etc. So, in this paper, a comparative analysis of six pre-trained sentence embedding techniques has been done to identify the best model suited for document ranking in IR systems. These are SentenceBERT, Universal Sentence Encoder, InferSent, ELMo, XLNet, and Doc2Vec. Four standard datasets CACM, CISI, ADI, and Medline are used to perform all the experiments. It is found that Universal Sentence Encoder and SentenceBERT outperform other techniques on all four datasets in terms of MAP, recall, F-measure, and NDCG. This comparative analysis offers a synthesis of existing work as a single point of entry for practitioners who seek to use pre-trained sentence embedding models for document ranking and for scholars who wish to undertake work in a similar domain. The work can be expanded in many directions in the future as various researchers can combine these strategies to build a hybrid document ranking system or query reformulation system in IR.
Around trillions of data are uploaded to the internet every year. Extracting useful information using only a few keywords has become a major challenge. The field of Query Reformulation (QR) is focused on the efficient retrieval of information to overcome this. It is widely used in the domain of information retrieval (IR) and related fields such as search engines, multimedia IR, cross-language IR, recommender systems, and so on. Query reformulation technique incur extra computational cost. Due to this reason, the use of query reformulation techniques is sometimes prohibited in internet searches as searching over the internet requires fast response time. But due to success of NLP (Natural Language Processing) using machine learning/deep learning in recent years, there has been a boom of study in this area. In this literature, a variety of term selection, term extraction, and query reformulation strategies based on recent technologies used by researchers have been presented, necessitating a wide survey to focus research in this promising area. Recent QR approaches along with the datasets, techniques and evaluation metrics used are provided in this paper that will help researchers to understand and to focus more on research in this promising area so that better solution will be proposed in future. From the survey, it may be observed that one of the hottest subjects in the field of IR right now is applying deep learning to IR system for query reformulation.
: Information retrieval (IR) is a field that concerns the structure, memory, analysis, and access to pieces of information. It has a wide application in various areas like search engines, communication systems, information filtering, medical search, etc. and helps design efficient and secure applications. This area has been a surge of research from the last few years due to data mining's unparalleled success, deep learning in computer vision, blockchain technology, etc. Core models, performance evaluation techniques, IR system applications, and its role in blockchain technology have been proposed in this literature, calling the need for a broad survey to focus the research in this promising area. This paper fills the space by surveying the state of art approaches with deep learning models, query expansion techniques used, and use of private information retrieval in blockchain technology. This survey paper includes different IR models like boolean model, vector space model, probabilistic model, language model, N-gram model, fuzzy model, Latent Semantic Indexing (LSI) Model, Bayesian network, Evolutionary algorithm based models and Machine Learning based models. Applications of IR systems along with different datasets are also included to provide further research in this field.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.