Large-Scale Sorting in Uniform Memory Hierarchies

Vitter, Jeffrey Scott; Nodine, Marian H.

doi:10.1006/jpdc.1993.1008

Cited by 26 publications

(11 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then it can be applied separately to each cache level, noting that the data transfer in the higher levels do not contribute for any given level. ✷ These lower bounds are in the same spirit as those of Vitter and Nodine [32] (for the S-UMH model) and Savage [28], that is, the lower bounds do not capture the simultaneous interaction of the different levels.…”

Section: Theorem 55 the Lower Bound For Sorting In The Restricted Mumentioning

confidence: 91%

See 1 more Smart Citation

Towards a theory of cache-efficient algorithms

Sen

Chatterjee²,

Dumir

2002

J. ACM

View full text Add to dashboard Cite

We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal algorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations in the single-level cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the average-case cache behavior of mergesort. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity.

show abstract

Section: Theorem 55 the Lower Bound For Sorting In The Restricted Mumentioning

confidence: 91%

“…Bilardi and Peserico [8] investigate further the complexity of designing algorithms without the knowledge architectural parameters. 2 Other attempts were directed towards extracting better performance by parallel memory hierarchies [32,33,14], where several blocks could be transferred simultaneously.…”

Section: Related Workmentioning

confidence: 99%

Towards a theory of cache-efficient algorithms

Sen

Chatterjee²,

Dumir

2002

J. ACM

View full text Add to dashboard Cite

show abstract

“…Nodine and Vitter [22] describe several efficient sorting algorithms for the parallel disk model. Interestingly, Nodine and Vitter [21] also consider a multiprocessor version of the parallel disk model, but not in a way that is appropriate for multicores, since they assume that the processors are interconnected via a PRAM-type network or share the entire internal memory (see, e.g., [28,29,30]). Assuming that processors share internal memory does not fit the current practice of multicore.…”

Section: Introductionmentioning

confidence: 99%

Fundamental parallel algorithms for private-cache chip multiprocessors

Arge

Goodrich

Nelson

et al. 2008

Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures

134

View full text Add to dashboard Cite

In this paper, we study parallel algorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that can scale to hundreds or even thousands of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way that cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. All algorithms in the paper are asymptotically optimal in terms of the parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache chip multiprocessors.

show abstract

“…These include both methods to improve the rate of I/O delivery to uniprocessor systems by i n troducing parallelism into the I/O subsystem, and methods of improving the I/O performance of multiprocessors. At the highest level, new theoretical models of parallel I/O systems are being developed 1,33,25,32], allowing the study of many fundamental algorithms in terms of their I/O complexity. A t the next level, new language and compiler features are being developed to support I/O parallelism and optimizations, using data layout conversion 12] and compiler hints 29].…”

Section: Introductionmentioning

confidence: 99%

Distributed scheduling algorithms to improve the performance of parallel data transfers

Durand¹,

Jain²,

Tseytlin³

1994

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

The cost of data transfers, and in particular of I/O operations, is a growing problem in parallel computing. This performance bottleneck is especially severe for data-intensive a p p l ications such a s m ultimedia information systems, databases, and Grand Challenge problems. A promising approach to alleviating this bottleneck i s t o s c hedule parallel I/O operations explicitly.Although centralized algorithms for batch s c heduling of parallel I/O operations have previously been developed, they are not be appropriate for all applications and architectures. We develop a class of decentralized algorithms for scheduling parallel I/O operations, where the objective is to reduce the time required to complete a given set of transfers. These algorithms, based on edge-coloring and matching of bipartite graphs, rely upon simple heuristics to obtain shorter schedules. We present s i m ulation results indicating that the best of our algorithms can produce schedules whose length is within 2 -20% of the optimal schedule, a substantial improvement on previous decentralized algorithms. We discuss theoretical and experimental work in progress and possible extensions.

show abstract

Large-Scale Sorting in Uniform Memory Hierarchies

Cited by 26 publications

References 6 publications

Towards a theory of cache-efficient algorithms

Towards a theory of cache-efficient algorithms

Fundamental parallel algorithms for private-cache chip multiprocessors

Distributed scheduling algorithms to improve the performance of parallel data transfers

Contact Info

Product

Resources

About