Federated search and distributed information retrieval systems provide a single user interface for searching multiple full-text search engines. They have been an active area of research for more than a decade, but in spite of their success as a research topic, they are still rare in operational environments. This article discusses a prototype federated search system developed for the U.S. government's FedStats Web portal, and the issues addressed in adapting research solutions to this operational environment. A series of experiments explore how well prior research results, parameter settings, and heuristics apply in the FedStats environment. The article concludes with a set of lessons learned from this technology transfer effort, including observations about search engine quality in the "real world."
IntroductionThe FedStats Web 1 site is a portal that provides "one-stop shopping" to statistical information published by more than 100 federal agencies so that citizens, businesses, and government employees can find what they need without knowing where it is stored or which agency publishes it. Topicspecific Web portals such as FedStats have become a crucial component of Web search in recent years because the proliferation of Web sites and search engines can make it difficult for people to know where to search for needed information. General-purpose search engines, such as Google 2 and AltaVista, 3 can be helpful, but their generality is sometimes more of an obstacle than an aid. For example, submitting the query "unemployment statistics" to Google returns a mix of federal, state, and foreign government information in the top 10 documents. Restricting the search to the ".gov" domain effects only a small improvement. The same query at the FedStats Web site returns information from 12 federal government agencies.Portals such as FedStats are usually based on one of two software architectures. The most common approach is to download documents from otherWeb sites, integrate them into a single large text database, and index it with a single search engine. General-purpose search engines, such as Google, use this approach; we call it the single-database approach in this report. The second approach is to link the search engines at each Web site into a federated search system. This approach is used within some large commercial search services (e.g.,
Federated Search (Distributed Information Retrieval)Federated search systems 5 provide a single-user interface to multiple search engines. The person using the federated search system may know (probably knows) that the