Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Lifflander, Jonathan; Miller, Pam; Venkataraman, Ramprasad; Arya, Anshu; Kalé, Laxmikant V.; Jones, Terry

doi:10.1109/ipdps.2012.61

Cited by 7 publications

(6 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In dense LU factorization, the matrix being factorized is decomposed into a 2D grid of blocks, which in the Charm++ implementation [38] is encapsulated in a chare array. We can succinctly describe the parallel control flow of a non-pivoting LU in SDAG as follows: Each block goes through various phases as it executes depending on its location in the matrix.…”

Section: Optssetqueueing(ck Queueing Lifo);mentioning

confidence: 99%

“…In our highly-scalable implementation of dense LU [38], we demonstrate how to exploit Charm++ groups to control incoming messages by explicitly scheduling when messages arrive. For LU, instead of sending the block of data when it is ready on the sender-side, we notify the receiver that the data is ready and allow the receiver to determine which blocks to request based on what is ready and the optimized schedule it has computed that adheres to the dependencies natural in an LU computation.…”

Section: Memory-aware Scheduling In Lumentioning

confidence: 99%

“…As described earlier, we have implemented a dense LU factorization library [38] in Charm++ that fully conforms to the HPC Challenge [11] specification. Our implementation is a fully-composable library (it can share space and time with another parallel Charm++ module) that allows for flexible data placement (by writing a simple block-to-processor function).…”

Section: Dense Lu Factorizationmentioning

confidence: 99%

See 2 more Smart Citations

Controlling Concurrency and Expressing Synchronization in Charm++ Programs

Kalé

Lifflander

2014

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Charm++ is a parallel programming system that evolved over the past 20 years to become a well-established system for programming parallel science and engineering applications, in addition to the combinatorial search applications with which it started. At its earliest point, the precursor to Charm++, the Chare Kernel, was a purely reactive specification, similar to most actor languages. This paper describes the evolution of a series of concurrency control mechanisms that have been deployed in Charm++ to tame this unrestricted concurrency in order to improve code clarity and/or to improve performance.

show abstract

Section: Optssetqueueing(ck Queueing Lifo);mentioning

confidence: 99%

Section: Memory-aware Scheduling In Lumentioning

confidence: 99%

Section: Dense Lu Factorizationmentioning

confidence: 99%

See 1 more Smart Citation

Controlling Concurrency and Expressing Synchronization in Charm++ Programs

Kalé

Lifflander

2014

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…The algorithms are then expressed by collectively addressing processes that own a row or column of matrix elements/blocks. For eg, recent work has demonstrated a high Terminology -n Number of processes in parent process group -m Number of processes participating in the new process group -k Branching factor (degree) of the spanning tree -d i,k Depth of a rank i process in a balanced spanning tree of branching factor k -f fraction of members of original process group participating in new group performance dense LU factorization using only the aforementioned collectives on non-trivially defined groups of processes, in a parallel programming paradigm that supports unranked and system-ranked process groups [10].…”

Section: System-ranked Process Groupsmentioning

confidence: 99%

Scalable Algorithms for Constructing Balanced Spanning Trees on System-Ranked Process Groups

Langer

Venkataraman

Kalé

2012

Recent Advances in the Message Passing Interface

Self Cite

View full text Add to dashboard Cite

Abstract. Current implementations of process groups (subcommunicators) have non-scalable (O(group size)) memory footprints and even worse time complexities for setting up communication. We propose systemranked process groups, where member ranks are picked by the runtime system, as a cheaper and faster alternative for a subset of collective operations (barrier, broadcast, reduction, allreduce). This paper presents two distributed algorithms for balanced, k-ary spanning tree construction over system-ranked process groups obtained by splitting a parent group. Our schemes have much smaller memory footprints and also perform better, even at modest process counts. We demonstrate performance results up to 131, 072 cores of BlueGene/P.

show abstract

“…The task scheduling between CPUs and GPUs is also important. Optimized task scheduling can effectively reduce the computation [8][9][10][11][12], which is corresponding to the idea on general purpose processors [13]. Reconfigurable platforms always focus on designing high performance float-point computation units and the scalable architecture [14][15][16].…”

Section: Introductionmentioning

confidence: 99%

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

Zhang

Chen

Liu

et al. 2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The LU decomposition is a widely used method to solve the dense linear algebra in many scientific computation applications. In recent years, the single instruction multiple data (SIMD) technology has been a popular method to accelerate the LU decomposition. However, the pipeline parallelism and memory bandwidth utilization are low when the LU decomposition mapped onto SIMD processors. This paper proposes a fine-grained pipelined implementation of LU decomposition on SIMD processors. The fine-grained algorithm well utilizes data dependences of the native algorithm to explore the fine-grained parallelism among all the computation resources. By transforming the non-coalesced memory access to coalesced version, the proposed algorithm can achieve the high pipeline parallelism and the high efficient memory access. Experimental results show that the proposed technology can achieve a speedup of 1.04x to 1.82x over the native algorithm and can achieve about 89% of the peak performance on the SIMD processor.

show abstract

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Cited by 7 publications

References 23 publications

Controlling Concurrency and Expressing Synchronization in Charm++ Programs

Controlling Concurrency and Expressing Synchronization in Charm++ Programs

Scalable Algorithms for Constructing Balanced Spanning Trees on System-Ranked Process Groups

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

Contact Info

Product

Resources

About