A term-based inverted index partitioning model for efficient distributed query processing

Cambazoğlu, B. Barla; Kayaaslan, Enver; Jonassen, Simon; Aykanat, Cevdet

doi:10.1145/2516633.2516637

Cited by 30 publications

(20 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The difference is that, in our case, we need to send a smaller amount of data among processors, versus the partial results (in some cases, whole inverted lists) that must be sent in inverted indexes. Also, while load imbalance is a problem in the pipelined strategy [12,60], Multiplexed should be less sensitive to the query bias. The first stage of Multiplexed query processing is fully balanced because any processor can carry out the binary search (indeed, the same query can be started by a different processor each time it is raised).…”

Section: Global Multiplexed Suffix Arraymentioning

confidence: 99%

Distributed text search using suffix arrays

Arroyuelo¹,

Bonacic²,

Gil-Costa³

et al. 2014

Parallel Computing

View full text Add to dashboard Cite

Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, where it is necessary to efficiently process intensive-traffic streams of on-line queries. This paper proposes strategies to enable such services by means of suffix arrays. We introduce techniques for deploying suffix arrays on clusters of distributed-memory processors and then study the processing of multiple queries on the distributed data structure. Even though the cost of individual search operations in sequential (non-distributed) suffix arrays is low in practice, the problem of processing multiple queries on distributed-memory systems, so that hardware resources are used efficiently, is relevant to services aimed at achieving high query throughput at low operational costs. Our theoretical and experimental performance studies show that our proposals are suitable solutions for building efficient and scalable on-line search services based on suffix arrays. IntroductionIn the last decade, the design of efficient data structures and algorithms for textual databases and related applications has received a great deal of attention, due to the rapid growth of the amount of text data available from different sources. Typical applications support text searches over big text collections in a client-server fashion, where the user queries are answered by a dedicated server [15]. The server efficiency-in terms of running time-is of paramount importance in cases where the services demanded by clients generate a heavy work load. A feasible way to overcome the limitations of sequential computers is to resort to the use of several computers, or processors, which work together to serve the ever increasing client demands [19].One such approach to efficient parallelization is to distribute the data onto the processors, in such a way that it becomes feasible to exploit locality via parallel processing of user requests, each on a subset of the data. As opposed to shared-memory models, this distributed-memory model provides the benefit of better * Corresponding author. Address: Av. España 1680, Valparaíso, Chile. Phone: +56 2 432 6722. Fax: +56 2 432 6702. , in distributed memory systems, and describes strategies to reduce the inter-processor communication and to improve the load balance at search time. Indexed Text SearchingThe advent of powerful processors and cheap storage has enabled alternative models for information retrieval, other than the traditional one of a collection of documents indexed by a fixed set of keywords. One is the full text model, in which the user expresses its information need via words, phrases or patterns to be matched for, and the information system retrieves those documents containing the user-specified pa...

show abstract

Section: Global Multiplexed Suffix Arraymentioning

confidence: 99%

Distributed text search using suffix arrays

Arroyuelo¹,

Bonacic²,

Gil-Costa³

et al. 2014

Parallel Computing

View full text Add to dashboard Cite

show abstract

“…Herein, we present several skipping optimizations and a new term assignment strategy. In contrast to the previously presented assignment optimizations [2,8,15], our strategy does not try to assign co-occurring terms to the same node or to do load balancing, but rather to maximize the pruning efficiency. Additionally, it opens a possibility for dynamic load balancing with low repartitioning overhead and hybrid query processing.…”

Section: Related Workmentioning

confidence: 99%

Improving the Performance of Pipelined Query Processing with Skipping

Jonassen

Bratsberg

2012

Web Information Systems Engineering - WISE 2012

Self Cite

View full text Add to dashboard Cite

Abstract. Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations, which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution.

show abstract

“…In order to show the validity of the algorithms proposed in our paper, we investigate undirectional HP models proposed for index partitioning of parallel IR systems [8,28], where replication is beneficial and commonly used [37]. Although we address the HP models used in parallel IR, our replication scheme can be used for any domain in which the underlying problem can be modeled as an undirected hypergraph.…”

Section: Applicationmentioning

confidence: 99%

“…In this HP model, the nets have unit costs due to the infinite result cache capacity assumption. 1 The weight of a vertex is set equal to either the number of postings in the inverted list of the term represented by that vertex [8] or the multiplication of term popularity and the corresponding posting list size [37]. The balance constraint in the former vertex weighting scheme corresponds to maintaining storage balance, whereas the balance constraint in the latter vertex weighting scheme corresponds to maintaining computational workload balance.…”

Section: Applicationmentioning

confidence: 99%

“…Models and methods based on hypergraph partitioning (HP) have been successfully used for different objectives in a wide range of areas such as parallel scientific computing [4,11,15,44], very large scale integration (VLSI) circuit layout design [1,32], parallel information retrieval (IR) [8], parallel volume rendering [9], and database systems [12,13,40].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Replicated partitioning for undirected hypergraphs

Selvitopi

Türk

Aykanat

2012

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

a b s t r a c tHypergraph partitioning (HP) and replication are diverse but powerful tools that are traditionally applied separately to minimize the costs of parallel and sequential systems that access related data or process related tasks. When combined together, these two techniques have the potential of achieving significant improvements in performance of many applications. In this study, we provide an approach involving a tool that simultaneously performs replication and partitioning of the vertices of an undirected hypergraph whose vertices represent data and nets represent task dependencies among these data. In this approach, we propose an iterative-improvement-based replicated bipartitioning heuristic, which is capable of move, replication, and unreplication of vertices. In order to utilize our replicated bipartitioning heuristic in a recursive bipartitioning framework, we also propose appropriate cut-net removal, cut-net splitting, and pin selection algorithms to correctly encapsulate the two most commonly used cutsize metrics. We embed our replicated bipartitioning scheme into the state-of-the-art multilevel HP tool PaToH to provide an effective and efficient replicated HP tool, rpPaToH. The performance of the techniques proposed and the tools developed is tested over the undirected hypergraphs that model the communication costs of parallel query processing in information retrieval systems. Our experimental analysis indicates that the proposed technique provides significant improvements in the quality of the partitions, especially under low replication ratios.

show abstract

A term-based inverted index partitioning model for efficient distributed query processing

Cited by 30 publications

References 30 publications

Distributed text search using suffix arrays

Distributed text search using suffix arrays

Improving the Performance of Pipelined Query Processing with Skipping

Replicated partitioning for undirected hypergraphs

Contact Info

Product

Resources

About