“…Communication acceleration: Existing communication acceleration techniques include, but are not limited to: (1) leveraging high throughput and low latency communication links, such as RDMA [34], [35], [36], InfiniBand, Intel Omni-Path, and NVIDIA's NVLink 3 ; (2) utilizing message passing interface (MPI) and MPI-like implementations like OpenMPI 4 and Gloo [37]; (3) using high-performance communication collectives, such as NCCL 5 and BLink [38], which support efficient communication between GPUs and many popular deep learning frameworks; (4) reducing data communication during synchronization process, such as gradient quantization, compression and sparsification [39], [40], [41], [42], [43], [44]; (5) using stale parameter updates to reduce the number of synchronization parameters, such as parameter freezing [45], [46], [47], Round-Robin Synchronous Parallel [48] and Bounded Staleness Parallel [49]; (6) tuning deep learning hyper-parameters, such as AutoByte [50]; (7) minimize user-level overhead by conducting parameter aggregation at the transport layer [13]; (8) improving network-layer performance, such as networklevel flow scheduling [51], [52] and congestion control [53]. Communication scheduling: Due to the layer-wise and tensor-wise structure of DNNs, some works continuously explore to maximize the overlap of communication and computation.…”