Balanced parallel sort on hypercube multiprocessors

Abali, Bülent; Özgüner, F.; Bataineh, Abdulla

doi:10.1109/71.224220

Cited by 21 publications

(10 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where γ d is defined as in (3), Γ(v, d) has to fulfill the constraint specified by (1), and pa = V e a , C e a is a valid design transformation of EPN e.…”

Section: Optimization Problemmentioning

confidence: 99%

“…Replicating processes increases data parallelism and structural unfolding of a process increases the task and pipeline parallelism by hierarchically instantiating more processes in the process network. Furthermore, as recursive algorithms are commonly used in mathematical [1] and multimedia [2] applications, we study the recursive description of processes as a structural unfolding method.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Expandable process networks to efficiently specify and explore task, data, and pipeline parallelism

Schor¹,

Yang²,

Bacivarov³

et al. 2013

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)

View full text Add to dashboard Cite

Running each application of a many-core system on an isolated (virtual) guest machine is a widely considered solution for performance and reliability issues. When a new application is started, the guest machine is assigned with an amount of computing resources that depends on the overall workload of the system and is not known to the designer at specification time. For instance, the computing resources might consist of many slow or a few fast processing elements. If the application is statically specified, as, for example, with Kahn process networks, the number of processing elements usable by an application is upper bounded by its number of processes. Similarly, the inter-process communication overhead might limit the maximum performance if the number of processing elements is significantly smaller than the number of processes. In this paper, we propose a formal extension for streaming programming models called expandable process networks (EPNs) that tackles this challenge by abstracting several possible granularities in a single specification. This enables the automatic exploration of task, data, and pipeline parallelism by two basic design transformation techniques, namely replication and unfolding. Then, the EPN semantics facilitates the synthesis of multiple design implementations that are all derived from one high-level specification. At runtime, the best fitting implementation for the given computing resources is selected to maximize the performance. Finally, we demonstrate the effectiveness of the proposed model on Intel's 48-core SCC processor.

show abstract

“…where γ d is defined as in (3), Γ(v, d) has to fulfill the constraint specified by (1), and pa = V e a , C e a is a valid design transformation of EPN e.…”

Section: Optimization Problemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Expandable process networks to efficiently specify and explore task, data, and pipeline parallelism

Schor¹,

Yang²,

Bacivarov³

et al. 2013

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)

View full text Add to dashboard Cite

show abstract

“…Under such a policy, compulsory misses (as well as coherence misses) become significant. There is extensive research in the development of algorithms to minimize the amount of communication needed on traditional architectures and in the development of new architectures to either reduce the communication or overlap several communication activities [19,[26][27][28][29][30][31][32][33][34]. However, communication time remains a significant fraction of total run time due to the large diameter of the topologies and the limited connectivity between processors.…”

Section: Some-bus Architecture Enhancementsmentioning

confidence: 99%

The performance of parallel matrix algorithms on a broadcast‐based architecture

Katsinis

Hecht

Zhu

et al. 2005

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYDue to advances in fiber-optics and very large scale integration (VLSI) technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper summarizes one such multiprocessor architecture called the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus). It also presents enhancements to the network interface and the cache and directory controllers which support cache block combining, capture and prefetch, and allow complete overlap of processing time with the communication time due to compulsory misses. The paper uses two fundamental matrix algorithms to characterize the impact of each enhancement on performance. Cache miss analysis and results from the execution of these programs on a SOME-Bus simulator show that block capture and prefetch combined with an effective block replacement policy succeed in significantly reducing the miss rate due to compulsory misses as the cache size increases, while a similar increase of cache size in traditional architectures leaves the miss rate due to compulsory misses unaffected.

show abstract

“…Most parallel sorting algorithms have been developed either in the context of PRAM models, e.g. [9][10][11], or network models [12][13][14]. These algorithms typically assume a large number of processors (comparable to the number of data elements) and either neglect communication costs (PRAM algorithms) or rely on a specific machine structure (network-based algorithms).…”

Section: The Performance Of a Selection Of Sorting Algorithmsmentioning

confidence: 99%