Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques
DOI: 10.1109/pact.2001.953304
|View full text |Cite
|
Sign up to set email alerts
|

Architectural support for parallel reductions in scalable shared-memory multiprocessors

Abstract: Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable.In this paper, we present new architectural support that significantly speeds-up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
0

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 16 publications
(24 citation statements)
references
References 26 publications
0
24
0
Order By: Relevance
“…Our work takes a different approach that modifies the cache coherence protocol to simultaneously maintain multiple modified copies of a cache line for reduction. While this approach is similar to the solution proposed by [Kim03] and [Garzaran01] in the context of distributed shared memory multi-processors, we extend its Figure 6. In our scheme, the cache lines which hold the reduction target are marked non-coherent and each core participating in the reduction operation is allowed to have a modified copy of the cache line while computing the partial reduced value.…”
Section: Parallel Reduction Hardwarementioning
confidence: 96%
“…Our work takes a different approach that modifies the cache coherence protocol to simultaneously maintain multiple modified copies of a cache line for reduction. While this approach is similar to the solution proposed by [Kim03] and [Garzaran01] in the context of distributed shared memory multi-processors, we extend its Figure 6. In our scheme, the cache lines which hold the reduction target are marked non-coherent and each core participating in the reduction operation is allowed to have a modified copy of the cache line while computing the partial reduced value.…”
Section: Parallel Reduction Hardwarementioning
confidence: 96%
“…Zotov [59] supports the barrier operation using a dedicated network. Other researchers have proposed adding specialized vector operations to the memory controller to support vector scatter-add [1] or parallel reduction operations [15]. The former works well for applications that are insensitive to floating point rounding errors and whose working set can fit into the caches, but requires programmers to handle the temporarily incoherent states of the affected data.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, the Impulse project has focused solely on uniprocessor systems, whereas our work leveraging cache coherence has shown improvements for both uniprocessor and single-node multiprocessor (SMP) systems, and, in this paper, on multinode systems as well. Our parallel reduction technique was initially proposed in a non-active memory context in [4], but also used software flushes to guarantee data coherence and required changes to both the main processor and its cache subsystem. We follow the same idea, but our leveraging of the cache coherence protocol eliminates flushes and provides transparency in the programming model and scalability to multiprocessor systems without any changes to the main processor or its caches.…”
Section: Related Workmentioning
confidence: 99%
“…In our active memory technique, the merge operations are done by the memory controller, not by the main processors. When each cache line of the shadow vector x is written back to memory, the memory controller performs the merge operation [4]. Therefore, the active memory technique can save processor busy time by eliminating the merge phase, and remote memory access time since the writebacks are not in the critical path of execution.…”
Section: Parallel Reductionmentioning
confidence: 99%
See 1 more Smart Citation