Performance of inverted indices in shared-nothing distributed text document information retrieval systems

Tomasic, Anthony; García-Molina, Héctor

doi:10.1109/pdis.1993.253078

Cited by 67 publications

(83 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The regression analysis performed confirms that the quadratic model fits better the real distribution (R = 0.99770) versus the linear model representing the Zipf's law (R = 0.98122). The quadratic model is similar to Zipf's, although in previous works [15], it has proved to match the actual distribution better. Given the quadratic fit curve, the form of the probability distribution Z 1 (w) is obtained from the quadratic model, divided by a normalisation constant [15].…”

Section: Document Modelmentioning

confidence: 98%

“…The previous work for distributing the inverted index over a collection of servers is focused on the local and global inverted files strategies [13], [15], showing that the local inverted file is a more balanced strategy and a good query throughput could be achieved in most cases.…”

Section: Related Workmentioning

confidence: 99%

“…If we are considering the whole collection Documents = D, but in a distributed environment, Documents corresponds to the number of documents covered by each of the distributed indices. So, the number of documents of an inverted list for term t i will be [15]:…”

Section: Query Modelmentioning

confidence: 99%

“…The other option is to partition based on the index terms so that each query server stores inverted lists corresponding to only a subset of the index terms in the collection (called global inverted files in [13]). The study in [15] indicates that the local inverted file organization uses system resources effectively, provides good query throughput and is more resilient to failures.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cacheda

Plachouras

Ounis

2004

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract.We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.

show abstract

Section: Document Modelmentioning

confidence: 98%

Section: Related Workmentioning

confidence: 99%

Section: Query Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cacheda

Plachouras

Ounis

2004

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The work presented in [26] compares the impact of performance for queries processing, using two different organizations for the invested lists. It proposes two basic options to classify the indexes: disk index and system index.…”

Section: Previous Work and Motivationmentioning

confidence: 99%

ImprovingWeb Searches with Distributed Buckets Structures

Gil-Costa

Printista

2006

2006 Fourth Latin American Web Congress

View full text Add to dashboard Cite

show abstract

Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems

Cambazoğlu

Çatal

Aykanat

2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate the effect of these two index partitioning schemes on query processing. We conduct experiments on a 32-node PC cluster, considering the case where index is completely stored in disk. Performance results are reported for a large (30 GB) document collection using an MPI-based parallel query processing implementation.

show abstract

Performance of inverted indices in shared-nothing distributed text document information retrieval systems

Cited by 67 publications

References 12 publications

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

ImprovingWeb Searches with Distributed Buckets Structures

Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems

Contact Info

Product

Resources

About