2012 IEEE 26th International Parallel and Distributed Processing Symposium 2012
DOI: 10.1109/ipdps.2012.61
|View full text |Cite
|
Sign up to set email alerts
|

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Abstract: Abstract-Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture-and implementation-sensitive. The critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivoting with the computation of rank-k updates. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
3
1
1

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 23 publications
0
6
0
Order By: Relevance
“…In dense LU factorization, the matrix being factorized is decomposed into a 2D grid of blocks, which in the Charm++ implementation [38] is encapsulated in a chare array. We can succinctly describe the parallel control flow of a non-pivoting LU in SDAG as follows: Each block goes through various phases as it executes depending on its location in the matrix.…”
Section: Optssetqueueing(ck Queueing Lifo);mentioning
confidence: 99%
See 2 more Smart Citations
“…In dense LU factorization, the matrix being factorized is decomposed into a 2D grid of blocks, which in the Charm++ implementation [38] is encapsulated in a chare array. We can succinctly describe the parallel control flow of a non-pivoting LU in SDAG as follows: Each block goes through various phases as it executes depending on its location in the matrix.…”
Section: Optssetqueueing(ck Queueing Lifo);mentioning
confidence: 99%
“…In our highly-scalable implementation of dense LU [38], we demonstrate how to exploit Charm++ groups to control incoming messages by explicitly scheduling when messages arrive. For LU, instead of sending the block of data when it is ready on the sender-side, we notify the receiver that the data is ready and allow the receiver to determine which blocks to request based on what is ready and the optimized schedule it has computed that adheres to the dependencies natural in an LU computation.…”
Section: Memory-aware Scheduling In Lumentioning
confidence: 99%
See 1 more Smart Citation
“…The algorithms are then expressed by collectively addressing processes that own a row or column of matrix elements/blocks. For eg, recent work has demonstrated a high Terminology -n Number of processes in parent process group -m Number of processes participating in the new process group -k Branching factor (degree) of the spanning tree -d i,k Depth of a rank i process in a balanced spanning tree of branching factor k -f fraction of members of original process group participating in new group performance dense LU factorization using only the aforementioned collectives on non-trivially defined groups of processes, in a parallel programming paradigm that supports unranked and system-ranked process groups [10].…”
Section: System-ranked Process Groupsmentioning
confidence: 99%
“…The task scheduling between CPUs and GPUs is also important. Optimized task scheduling can effectively reduce the computation [8][9][10][11][12], which is corresponding to the idea on general purpose processors [13]. Reconfigurable platforms always focus on designing high performance float-point computation units and the scalable architecture [14][15][16].…”
Section: Introductionmentioning
confidence: 99%