Full-text search engines are important tools for information retrieval. Term proximity is an important factor in relevance score measurement. In a proximity full-text search, we assume that a relevant document contains query terms near each other, especially if the query terms are frequently occurring words. A methodology for high-performance full-text query execution is discussed. We build additional indexes to achieve better efficiency. For a word that occurs in the text, we include in the indexes some information about nearby words. What types of additional indexes do we use? How do we use them? These questions are discussed in this work. We present the results of experiments showing that the average time of search query execution is 44-45 times less than that required when using ordinary inverted indexes.
Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For each word in a text, we use additional indexes to store information about nearby words that are at distances from the given word of less than or equal to the MaxDistance parameter. We showed that additional indexes with threecomponent keys can be used to improve the average query execution time by up to 94.7 times if the queries consist of high-frequency occurring words. In this paper, we present a new search algorithm with even more performance gains. We consider several strategies for selecting multi-component key indexes for a specific query and compare these strategies with the optimal strategy. We also present the results of search experiments, which show that three-component key indexes enable much faster searches in comparison with two-component key indexes.
In this paper, proximity full-text searches in large text arrays are considered. A search query consists of several words. The search result is a list of documents containing these words. In a modern search system, documents that contain search query words that are near each other are more relevant than documents that do not share this trait. To solve this task, for each word in each indexed document, we need to store a record in the index. In this case, the query search time is proportional to the number of occurrences of the queried words in the indexed documents. Consequently, it is common for search systems to evaluate queries that contain frequently occurring words much more slowly than queries that contain less frequently occurring, ordinary words. For each word in the text, we use additional indexes to store information about nearby words at distances from the given word of less than or equal to MaxDistance, which is a parameter. This parameter can take a value of 5, 7, or even more. Threecomponent key indexes can be created for faster query execution. Previously, we presented the results of experiments showing that when queries contain very frequently occurring words, the average time of the query execution with three-component key indexes is 94.7 times less than that required when using ordinary inverted indexes. In the current work, we describe a new three-component key index building algorithm and demonstrate the correctness of the algorithm. We present the results of experiments creating such an index that is dependent on the value of MaxDistance.In this paper, we continue our research [1]. In the development of modern methods of full-text search, documents that contain queried words near each other are considered more important and relevant [1][2][3][4]. The importance of taking proximity information into account in the calculation of relevance increases for larger text collections [3]. At the same time, we need to guarantee that the search time is limited by reasonable boundaries. However, for large text collections, the probability of performance problems related to the search time increases.Inverted indexes are used for the implementation of the full-text search [5][6][7][8]. To take into account the distance between words in the text, we need to store in the index information about every occurrence of every word of every indexed text. Words occur in texts with different frequencies.A typical word frequency distribution in texts [9] (Zipf's law) is presented in Fig. 1. The horizontal axis is used to represent words, with high-frequently occurring words on the left side to low-frequently occurring words on the right side. On the vertical axis, we plot the total number of occurrences in the texts of each word. 1) A.ID < B.ID or 2) A.ID = B.ID and A.P < B.P.Among the performance improvement methods, the following methods can be considered: 1) Early-termination methods [13,14] are based on a special sorting of the postings in the index, in order of decreasing the relevance of the posting. At some poin...
Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the M axDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of M axDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.