Architectural support for parallel reductions in scalable shared-memory multiprocessors

Garzarán, María Jesús; Prvulovic, Milos; Zhang, Ye; Jula, Alin; Yu, Hao; Rauchwerger, Lawrence; Torrellas, Josep

doi:10.1109/pact.2001.953304

Cited by 16 publications

(24 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work takes a different approach that modifies the cache coherence protocol to simultaneously maintain multiple modified copies of a cache line for reduction. While this approach is similar to the solution proposed by [Kim03] and [Garzaran01] in the context of distributed shared memory multi-processors, we extend its Figure 6. In our scheme, the cache lines which hold the reduction target are marked non-coherent and each core participating in the reduction operation is allowed to have a modified copy of the cache line while computing the partial reduced value.…”

Section: Parallel Reduction Hardwarementioning

confidence: 96%

Scaling performance of interior-point method on large-scale chip multiprocessor system

Smelyanskiy

Lee

Kim

et al. 2007

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing

View full text Add to dashboard Cite

In this paper we describe parallelization of interior-point method (IPM) aimed at achieving high scalability on large-scale chipmultiprocessors (CMPs). IPM is an important computational technique used to solve optimization problems in many areas of science, engineering and finance. IPM spends most of its computation time in a few sparse linear algebra kernels. While each of these kernels contains a large amount of parallelism, sparse irregular datasets seen in many optimization problems make parallelism difficult to exploit. As a result, most researchers have shown only a relatively low scalability of 4X-12X on medium to large scale parallel machines. This paper proposes and evaluates several algorithmic and hardware features to improve IPM parallel performance on largescale CMPs. Through detailed simulations, we demonstrate how exploring multiple levels of parallelism with hardware support for low overhead task queues and parallel reduction enables IPM to achieve up to 48X parallel speedup on a 64-core CMP.

show abstract

Section: Parallel Reduction Hardwarementioning

confidence: 96%

Scaling performance of interior-point method on large-scale chip multiprocessor system

Smelyanskiy

Lee

Kim

et al. 2007

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Zotov [59] supports the barrier operation using a dedicated network. Other researchers have proposed adding specialized vector operations to the memory controller to support vector scatter-add [1] or parallel reduction operations [15]. The former works well for applications that are insensitive to floating point rounding errors and whose working set can fit into the caches, but requires programmers to handle the temporarily incoherent states of the affected data.…”

Section: Related Workmentioning

confidence: 99%

Active memory controller

et al. 2012

View full text Add to dashboard Cite

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.The work was done when most of the authors were at the University of Utah. The views and conclusions contained herein are those of the authors and should not be interpreted as representing those, either express or implied, of Intel, CAS, IBM, Chalmers, AMD, nVidia, or the University of Utah. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×-15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.

show abstract

“…In addition, the Impulse project has focused solely on uniprocessor systems, whereas our work leveraging cache coherence has shown improvements for both uniprocessor and single-node multiprocessor (SMP) systems, and, in this paper, on multinode systems as well. Our parallel reduction technique was initially proposed in a non-active memory context in [4], but also used software flushes to guarantee data coherence and required changes to both the main processor and its cache subsystem. We follow the same idea, but our leveraging of the cache coherence protocol eliminates flushes and provides transparency in the programming model and scalability to multiprocessor systems without any changes to the main processor or its caches.…”

Section: Related Workmentioning

confidence: 99%

“…In our active memory technique, the merge operations are done by the memory controller, not by the main processors. When each cache line of the shadow vector x is written back to memory, the memory controller performs the merge operation [4]. Therefore, the active memory technique can save processor busy time by eliminating the merge phase, and remote memory access time since the writebacks are not in the critical path of execution.…”

Section: Parallel Reductionmentioning

confidence: 99%