“…Yet, different from those traditional workloads such as HPC processing [17,34] or MapReduce [31], ML computations are highly structured (with thousands of short iterations) and also tensor-based (samples in each iteration are packaged into a non-divisible matrix for fast processing). Static load balancing approaches [51,73,78] are not aware of runtime resource variations, and dynamic approaches [13,41,82], which are mainly based on work stealing [12,17,34], are also deficient for ML workloads. First, work stealing usually requires fine-grained worker progress monitoring and runtime load migration, which is inefficient for an iterative model training process.…”