We propose an approach to content-based Distributed Information Retrieval based on the periodic and incremental centralization of full-content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative's (OAI) Protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data move toward the retrieval process, but it is statistics about the content rather than content itself; this grants more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision while promoting the simplicity, effectiveness, and responsiveness of retrieval. Overall, we argue that the approach retains the good properties of centralized retrieval without renouncing to costeffective, large-scale resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure. In particular, we define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. Finally, we report on the implementation of a proof-of-concept prototype service for multimodel content-based retrieval of distributed file collections.
IntroductionOur interest is in content-based retrieval of widely dispersed and autonomously managed document sources. 1 This is the central problem of Distributed Information Retrieval (DIR), and over the past 10 years, it has been mainly approached by distributing the retrieval process along with the data: Queries have been "pushed" toward the content, and the results of their local execution have been centrally gathered and presented to the user (cf. Callan, 2000a).Traditionally, distributed retrieval services have relied on simple client/server architectures in which brokers route queries submitted by local or remote clients toward a number of mutually autonomous and potentially uncooperative retrieval engines. Figure 1 shows how client/server distributed retrieval works. A Search Broker B interfaces Clients C and dispatches their Queries Q to a number of autonomous search engines. S 1 , S 2 , S n , each of which executes it against an Index FT i of some Content C i before returning Results R i back to B, which merges them and relays them to C. Optionally, B optimizes query distribution by selecting a subset of the engines based on previously gathered descriptions of their content. Based on summary descriptions of the content served by each engine, advanced techniques of source selection and data fusion have been produced to, respectively, minimize network In the lack of a well-established terminology, we use the term contentbased to characterize retrieval processes defined over indices of essentially unstructured documents. Content-based retrieval lies at one...