We present MATE, a new model for developing communication-tolerant scientific applications. MATE employs a combination of mechanisms to reduce or hide the cost of network and intra-node communication. While previous approaches have been proposed to either source of communication overhead separately, the contribution of MATE is demonstrating the symbiotic effect of reducing both forms of data movement taken together in a single unified model. We explain the rationale behind our model and show its effectiveness in three scientific computing motifs on up to 64k cores of the NERSC Cori supercomputer. Lastly, we show how MATE can improve the workload balance of an irregular multigrid solver.xviii Basic-MPI and MATE variants on 64, 256, and 1024 nodes. We can see that the MATE variant 113 achieves a constant ratio of ∼ 71% computation across all scales. If we interpret this ratio as MATE's upper bound for core usage, we see that its speedup is related to the difference of this bound and the computation ratio of the Basic-MPI variant.At smaller scales (64 and 256 nodes), the Basic-MPI variant achieves a relatively good computation ratio (69% and 65%, respectively), which offers little improvement potential towards MATE's upper bound. On 1024 nodes, however, Basic-MPI's computation time decreases to 53%, allowing MATE to obtain a more significant improvement. We attribute this decrease to a communication cost jump that occurs between 256 and 1024 nodes. Since each cabinet in Cori Phase II contains 192 nodes (appendix A), it is more likely that, on 1024 nodes, many more messages are crossing cabinet boundaries through optical links, causing a noticeable increase in network communication costs.
SummaryWe have shown that the MATE model can improve the performance of dense matrix multiplication on our two supercomputing testbeds. Our matrix size scaling experiment shows that overlapping strategies can obtain speedups over a broad range of matrix sizes, only failing to obtain speedups over the baseline algorithm at small matrix scales, in which the fixed cost of communication dominates the running time, and on large matrix sizes, where the computation costs are the primary cost. Furthermore, this experiment shows that MATE outperforms the manually-overlapping variant at every matrix size.Through our weak scaling studies, we determined that MATE can reduce almost half of the communication costs at large scales and obtain significant speedups. On 16k cores of the Cori Phase I testbed, MATE was able to reduce 45% of communication costs, yielding a 13% improvement over the baseline algorithm. On 64k cores of the Cori Phase II testbed, MATE reduced 48% of communication costs, yielding a 18% improvement over the baseline algorithm.