We present a scalable distributed data structure called LH*. LH* generalizes Linear Hashing (LH) to distributed RAM and disk files. An LH* file can be created from records with primary keys, or objects with OIDs, provided by any number of distributed and autonomous clients. It does not require a central directory, and grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The number of messages per random insertion is one in general, and three in the worst case, regardless of the file size. The number of messages per key search is two in general, and four in the worst case. The file supports parallel operations, e.g., hash joins and scans. Performing a parallel operation on a file of M buckets costs at most 2M ϩ 1 messages, and between 1 and O(log 2 M) rounds of messages.We first describe the basic LH* scheme where a coordinator site manages bucket splits, and splits a bucket every time a collision occurs. We show that the average load factor of an LH* file is 65-70% regardless of file size, and bucket capacity. We then enhance the scheme with load control, performed at no additional message cost. The average load factor then increases to 80 -95%. These values are about that of LH, but the load factor for LH* varies more.We next define LH* schemes without a coordinator. We show that insert and search costs are the same as for the basic scheme. The splitting cost decreases on the average, but becomes more variable, as cascading splits are needed to prevent file overload. Next, we briefly describe two variants of splitting policy, using parallel splits and presplitting that should enhance performance for high-performance applications.All together, we show that LH* files can efficiently scale to files that are orders of magnitude larger in size than single-site files. LH* files that reside in main memory may also be much faster than single-site disk files. Finally, LH* files can be more efficient than any distributed file with a centralized directory, or a static parallel or distributed hash file.
Mirroring is a popular technique for enhancing file availability. We incorporate this technique into the LH* algorithms for scalable distributed linear hash files. Several schemes for mirroring LH* files are presented in this paper. The schemes increase the availability of LH'! files in the presence of node failures. Every record reniains accessible in the presence of a single node ,failure, and usually in the presence of multiple-node failures. The price is, as usual, twice as much storage for data, and an increase in the number of messages. The different schemes are characterized by different trade-offs, and they accommodate diverse application requirements. The additional messaging cost per insert is about the same for all the schemes, and is roughly only one message. The cost of a bucket recovery may in contrast vary greatly, from one message for one type of scheme, to a few for another, and many for yet another.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.