Information specialists in enterprises regularly use distributed information retrieval (DIR) systems that query a large number of information retrieval (IR) systems, merge the retrieved results, and display them to users. There can be considerable heterogeneity in the quality of results returned by different IR servers. Further, because different servers handle collections of different sizes and have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the DIR system has to decide which servers to query, how long to wait for responses, and which retrieved results to display based on the benefits and costs imposed on users. The benefit of querying more servers and waiting longer is the ability to retrieve more documents. The costs may be in the form of access fees charged by IR servers or user's cost associated with waiting for the servers to respond. We formulate the broker's decision problem as a stochastic mixed-integer program and present analytical solutions for the problem. Using data gathered from FedStats-a system that queries IR engines of several U.S. federal agencies-we demonstrate that the technique can significantly increase the utility from DIR systems. Finally, simulations suggest that the technique can be applied to solve the broker's decision problem under more complex decision environments. Electronic copy available at: http://ssrn.com/abstract=926928
User-Centric Operational Decision-Making in Distributed Information RetrievalKartik Hosanagar +
AbstractInformation specialists in enterprises regularly use Distributed Information Retrieval (DIR) systems that query a large number of Information Retrieval (IR) systems, merge the retrieved results and display them to users. There can be considerable heterogeneity in the quality of results returned by different IR servers. Further, since different servers handle collections of different sizes, have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the distributed IR system has to decide which servers to query, how long to wait for responses and which retrieved results to display based on the benefits and costs imposed on users. The benefit of querying more servers and waiting longer is the ability to retrieve more documents. The costs may be in the form of access fees charged by IR servers or user's cost associated with waiting for the servers to respond. We formulate the broker's decision problem as a stochastic mixed integer program and present analytical results for the optimal query set and wait time. Using data gathered from Fedstats -a system that queries IR engines of several US federal agencies -we demonstrate that the technique can significantly increase the utility from DIR systems. Finally, we present a simulation-based optimization technique to solve the broker's decision problem under more complex decision environments. The technique is computationally efficient and can be used to ge...