Coded computation is a method to mitigate "stragglers" in distributed computing systems through the use of error correction coding that has lately received significant attention. First used in vector-matrix multiplication, the range of application was later extended to include matrix-matrix multiplication, heterogeneous networks, convolution, and approximate computing. A drawback to previous results is they completely ignore work completed by stragglers. While stragglers are slower compute nodes, in many settings the amount of work completed by stragglers can be non-negligible. Thus, in this work, we propose a hierarchical coded computation method that exploits the work completed by all compute nodes. We partition each node's computation into layers of sub-computations such that each layer can be treated as (distinct) erasure channel. We then design different erasure codes for each layer so that all layers have the same failure exponent. We propose design guidelines to optimize parameters of such codes. Numerical results show the proposed scheme has an improvement of a factor of 1.5 in the expected finishing time compared to previous work.
In cloud computing systems slow processing nodes, often referred to as "stragglers", can significantly extend the computation time. Recent results have shown that error correction coding can be used to reduce the effect of stragglers. In this work we introduce a scheme that, in addition to using error correction to distribute mixed jobs across nodes, is also able to exploit the work completed by all nodes, including stragglers. We first consider vector-matrix multiplication and apply maximum distance separable (MDS) codes to small blocks of sub-matrices. The worker nodes process blocks sequentially, working block-by-block, transmitting partial per-block results to the master as they are completed. Sub-blocking allows a more continuous completion process, which thereby allows us to exploit the work of a much broader spectrum of processors and reduces computation time. We then apply this technique to matrix-matrix multiplication using product code. In this case, we show that the order of computing sub-tasks is a new degree of design freedom that can be exploited to reduce computation time further. We propose a novel approach to analyze the finishing time, which is different from typical order statistics. Simulation results show that the expected computation time decreases by a factor of at least two in compared to previous methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.