This paper describes the design of the Gamma database machine and the techniques employed in its implementation. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to 100s of processors. First, all relations are horizontally partitioned across multiple disk drives enabling relations to be scanned in parallel. Second, novel parallel algorithms based on hashing are used to implement the complex relational operators such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques it is possible to control the execution of very complex queries with minimal coordination -a necessity for configurations involving a very large number of processors.In addition to describing the design of the Gamma software, a thorough performance evaluation of the iPSC/2 hypercube version of Gamma is also presented. In addition to measuring the effect of relation size and indices on the response time for selection, join, aggregation, and update queries, we also analyze the performance of Gamma relative to the number of processors employed when the sizes of the input relations are kept constant (speedup) and when the sizes of the input relations are increased proportionally to the number of processors (scaleup). The speedup results obtained for both selection and join queries are linear; thus, doubling the number of processors halves the response time for a query. The scaleup results obtained are also quite encouraging. They reveal that a nearly constant response time can be maintained for both selection and join queries as the workload is increased by adding a proportional number of processors and disks.
We present a scalable distributed data structure called LH*. LH* generalizes Linear Hashing (LH) to distributed RAM and disk files. An LH* file can be created from records with primary keys, or objects with OIDs, provided by any number of distributed and autonomous clients. It does not require a central directory, and grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The number of messages per random insertion is one in general, and three in the worst case, regardless of the file size. The number of messages per key search is two in general, and four in the worst case. The file supports parallel operations, e.g., hash joins and scans. Performing a parallel operation on a file of M buckets costs at most 2M ϩ 1 messages, and between 1 and O(log 2 M) rounds of messages.We first describe the basic LH* scheme where a coordinator site manages bucket splits, and splits a bucket every time a collision occurs. We show that the average load factor of an LH* file is 65-70% regardless of file size, and bucket capacity. We then enhance the scheme with load control, performed at no additional message cost. The average load factor then increases to 80 -95%. These values are about that of LH, but the load factor for LH* varies more.We next define LH* schemes without a coordinator. We show that insert and search costs are the same as for the basic scheme. The splitting cost decreases on the average, but becomes more variable, as cascading splits are needed to prevent file overload. Next, we briefly describe two variants of splitting policy, using parallel splits and presplitting that should enhance performance for high-performance applications.All together, we show that LH* files can efficiently scale to files that are orders of magnitude larger in size than single-site files. LH* files that reside in main memory may also be much faster than single-site disk files. Finally, LH* files can be more efficient than any distributed file with a centralized directory, or a static parallel or distributed hash file.
Recently we have proposed an adaptive, random sampling algorithm for general query size estlmatlon In earlier work we analyzed the asymptotic ef'l?clency and accuracy of the algorithm, m this paper we mvestlgate Its practlcahty as applied to selects and Jams First, we extend our previous analysis to provide agmficantly improved bounds on the amount of samplmg necessary for a given level of accuracy Next, we provide "sanity bounds" to deal with queries for which the underlying data 1s extremely skewed or the query result 1s very small Finally, we report on the performance of the estlmatlon algorithm as amplemented m a host language on a commercial relational system The results are encouraging, even with this loose couplmg between the estlmatlon algorithm and the DBMS
The join operator has been a cornerstone of relational database systems since their inception. As such, much time and effort has gone into making joins efficient. With the obvious trend towards multiprocessors, attention has focused on efficiently parallelizing the join operation. In this paper we analyze and compare four parallel join algorithms. Grace and Hybrid hash represent the class of hash-based join methods, Simple hash represents a looping algorithm with hashing, and our last algorithm is the more traditional sort-merge. The Gamma database machine serves as the host for the performance comparison. Gamma's shared-nothing architecture with commercially available components is becoming increasingly common, both in research and in industry.
We consider the problem of external sorting in a shared-nothing multiprocessor. A critical step in the algorithms we consider is to determine the range of sort keys to be handled by e a c h processor. We consider two t e c hniques for determining these ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by I y er, Ricard, and Varman and probabilistic splitting, which uses sampling to estimate quantiles. We present analytic results showing that probabilistic splitting performs better than exact splitting. Finally, w e present experimental results from an implementation of sorting via probabilistic splitting in the Gamma parallel database machine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.