In this study, the effects of query structure and various setups of translation dictionaries on the performance of cross-language information retrieval (CLIR) were tested. The document collection was a subset of the TREC collection, and as test requests the study used TREC's health related topics. The test system was the INQUERY' retrieval system. The performance of translated Finnish queries against English documents was compared to the performance of original English queries against English documents. Four natural language query types and three query translation methods, using a general dictionary and a domain specific (= medical) dictionary, were studied. There was only a slight difference in performance between the original English queries and the best crosslanguage queries, i.e., structured queries with medical dictionary and general dictionary translation. The structuring of queries was done on the basis of the output of dictionaries.
Abstract. There is overwhelming evidence suggesting that the real users of IR systems often prefer using extremely short queries (one or two individual words) but they try out several queries if needed. Such behavior is fundamentally different from the process modeled in the traditional test collection-based IR evaluation based on using more verbose queries and only one query per topic. In the present paper, we propose an extension to the test collection-based evaluation. We will utilize sequences of short queries based on empirically grounded but idealized session strategies. We employ TREC data and have test persons to suggest search words, while simulating sessions based on the idealized strategies for repeatability and control. The experimental results show that, surprisingly, web-like very short queries (including one-word query sequences) typically lead to good enough results even in a TREC type test collection. This finding motivates the observed real user behavior: as few very simple attempts normally lead to good enough results, there is no need to pay more effort. We conclude by discussing the consequences of our finding for IR evaluation.
This paper presents a morphological classification of languages from the IR perspective. Linguistic typology research has shown that the morphological complexity of each language of the world can be described by two variables, index of synthesis and index of fusion. These variables provide a theoretical basis for IR research handling morphological issues. A common theoretical framework is needed in particular due to the increasing significance of cross-language retrieval research and CLIR systems processing different languages. The paper elaborates the linguistic morphological typology for the purposes of IR research. It is studied how the indices of synthesis and fusion could be used as practical tools in mono-and cross-lingual IR research. The need for semantic and syntactic typologies is discussed. The paper also reviews studies done in different languages on the effects of morphology and stemming in IR.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.