Exploiting communication concurrency on high performance computing systems

Chaimov, Nicholas; Ibrahim, Khaled Z.; Williams, Samuel; Iancu, Costin

doi:10.1145/2712386.2712394

Cited by 3 publications

(4 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4.5. This step entails replicating the original for induction variable, once for each region (lines 10, 14,18,22,26), and initializing them with the same expression as that of the source code's for loop initializer.…”

Section: Step Iii: Parsing For Loop Graphsmentioning

confidence: 99%

“…When all regions have been polled, it surrenders the rank's ownership by releasing its lock (21) and looks for the next rank (7). If all ranks in the process have finished, the worker thread exits (via pthread exit()), removing the KLT from the OS scheduler (22) and signaling the pthread join() (23) operation in the master thread to indicate the current thread has finished. Once all threads exit (i.e., every call to pthread join meets a reciprocal pthread exit call from each worker thread), the master thread finalizes the process execution (24).…”

Section: Execution Model Worker Executionmentioning

confidence: 99%

“…If the rank is in waiting state, it means that one or more regions have pending MPI requests to complete before becoming ready again (22). If this is the case, the rank yields execution by calling its ULT's yield method.…”

Section: Executing Annotated Codementioning

confidence: 99%

“…Although reducing the cost of network communication data motion involves primarily overlapping computation with communication, other approaches include communication reordering [61], concurrency optimizations [22], and communication avoiding algorithms [15,44].…”

mentioning

confidence: 99%

See 3 more Smart Citations

MATE, a Unified Model for Communication-Tolerant Scientific Applications

Martín

Baden

2019

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

We present MATE, a new model for developing communication-tolerant scientific applications. MATE employs a combination of mechanisms to reduce or hide the cost of network and intra-node communication. While previous approaches have been proposed to either source of communication overhead separately, the contribution of MATE is demonstrating the symbiotic effect of reducing both forms of data movement taken together in a single unified model. We explain the rationale behind our model and show its effectiveness in three scientific computing motifs on up to 64k cores of the NERSC Cori supercomputer. Lastly, we show how MATE can improve the workload balance of an irregular multigrid solver.xviii Basic-MPI and MATE variants on 64, 256, and 1024 nodes. We can see that the MATE variant 113 achieves a constant ratio of ∼ 71% computation across all scales. If we interpret this ratio as MATE's upper bound for core usage, we see that its speedup is related to the difference of this bound and the computation ratio of the Basic-MPI variant.At smaller scales (64 and 256 nodes), the Basic-MPI variant achieves a relatively good computation ratio (69% and 65%, respectively), which offers little improvement potential towards MATE's upper bound. On 1024 nodes, however, Basic-MPI's computation time decreases to 53%, allowing MATE to obtain a more significant improvement. We attribute this decrease to a communication cost jump that occurs between 256 and 1024 nodes. Since each cabinet in Cori Phase II contains 192 nodes (appendix A), it is more likely that, on 1024 nodes, many more messages are crossing cabinet boundaries through optical links, causing a noticeable increase in network communication costs. SummaryWe have shown that the MATE model can improve the performance of dense matrix multiplication on our two supercomputing testbeds. Our matrix size scaling experiment shows that overlapping strategies can obtain speedups over a broad range of matrix sizes, only failing to obtain speedups over the baseline algorithm at small matrix scales, in which the fixed cost of communication dominates the running time, and on large matrix sizes, where the computation costs are the primary cost. Furthermore, this experiment shows that MATE outperforms the manually-overlapping variant at every matrix size.Through our weak scaling studies, we determined that MATE can reduce almost half of the communication costs at large scales and obtain significant speedups. On 16k cores of the Cori Phase I testbed, MATE was able to reduce 45% of communication costs, yielding a 13% improvement over the baseline algorithm. On 64k cores of the Cori Phase II testbed, MATE reduced 48% of communication costs, yielding a 18% improvement over the baseline algorithm.

show abstract

Section: Step Iii: Parsing For Loop Graphsmentioning

confidence: 99%

Section: Execution Model Worker Executionmentioning

confidence: 99%

Section: Executing Annotated Codementioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations