Partial collection replication versus caching for information retrieval systems

Lu, Zhixing; McKinley, Kathryn S.

doi:10.1145/345508.345591

Cited by 22 publications

(13 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several articles [2], [5], [12] analyze the performance of a distributed IR system using collections of different sizes and different system architectures. Cahoon and McKinley in [3] describe the result of simulated experiments on the distributed INQUERY architecture.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cacheda

Plachouras

Ounis

2004

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract.We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.

show abstract

Section: Related Workmentioning

confidence: 99%

“…This is due to the round robin distribution policy used in the brokers, as it can lead to some small periods of inactivity at certain replicas. In future works, some other distribution policies can be analysed in order to improve the throughput up to the optimal theoretical value, similar to the one used in [12].…”

Section: Replicated Systemmentioning

confidence: 99%

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cacheda

Plachouras

Ounis

2004

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Our approach is that beacons remember the results of previous user queries, and use these results to guide future queries. Unlike previous caching schemes (such as [27]), the InfoBeacons cache is not used to answer queries but instead to direct queries to the sources themselves. We introduce a function, called ProbResults, that ranks sources for a given query based on past results stored in the beacon's cache.…”

Section: Introductionmentioning

confidence: 99%

Guiding Queries to Information Sources with InfoBeacons

Cooper¹

2004

Middleware 2004

View full text Add to dashboard Cite

Abstract. The Internet provides a wealth of useful information in a vast number of dynamic information sources, but it is difficult to determine which sources are useful for a given query. Most existing techniques either require explicit source cooperation (for example, by exporting data summaries), or build a relatively static source characterization (for example, by assigning a topic to the source). We present a system, called InfoBeacons, that takes a different approach: data and sources are left "as is," and a peer-to-peer network of beacons uses past query results to "guide" queries to sources, who do the actual query processing. This approach has several advantages, including requiring minimal changes to sources, tolerance of dynamism and heterogeneity, and the ability to scale to large numbers of sources. We present the architecture of the system, and discuss the advantages of our design. We then focus on how a beacon can choose good sources for a query despite the loose coupling of beacons to sources. Beacons cache responses to previous queries and adapt the cache to changes at the source. The cache is then used to select good sources for future queries. We discuss results from a detailed experimental study using our beacon prototype which demonstrates that our "loosely coupled" approach is effective; a beacon only has to contact sixty percent or less of the sources contacted by existing, tightly coupled approaches, while providing results of equivalent or better relevance to queries.

show abstract

“…The base sub-collection of 8.5 million documents has been distributed over N query servers using a switched network and three brokers, where N = 1, 2,4,8,16,32,64,128,256 and 512. In Table 1, the column Configuration describes the query servers assigned to each topic.…”

Section: Experimental Settingmentioning

confidence: 99%

“…Frieder and Siegelmann [9] studied the organisation of the data to improve the performance of parallel IR systems using multiprocessor computers. Lu and McKinley [16] analysed the effects of partial replication to improve the performance in a collection of 1TB. Moffat, Webber, Zobel and BaezaYates [18] presented a replication technique for a pipelined term distributed system, which significantly improves the throughput over a basic term distributed system.…”

Section: Introductionmentioning

confidence: 99%

Performance Comparison of Clustered and Replicated Information Retrieval Systems

Cacheda

Carneiro

Plachouras

et al.

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The amount of information available over the Internet is increasing daily as well as the importance and magnitude of Web search engines. Systems based on a single centralised index present several problems (such as lack of scalability), which lead to the use of distributed information retrieval systems to effectively search for and locate the required information. A distributed retrieval system can be clustered and/or replicated. In this paper, using simulations, we present a detailed performance analysis, both in terms of throughput and response time, of a clustered system compared to a replicated system. In addition, we consider the effect of changes in the query topics over time. We show that the performance obtained for a clustered system does not improve the performance obtained by the best replicated system. Indeed, the main advantage of a clustered system is the reduction of network traffic. However, the use of a switched network eliminates the bottleneck in the network, markedly improving the performance of the replicated systems. Moreover, we illustrate the negative performance effect of the changes over time in the query topics when a distributed clustered system is used. On the contrary, the performance of a distributed replicated system is query independent.

show abstract

Partial collection replication versus caching for information retrieval systems

Cited by 22 publications

References 23 publications

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Guiding Queries to Information Sources with InfoBeacons

Performance Comparison of Clustered and Replicated Information Retrieval Systems

Contact Info

Product

Resources

About