AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Zheng, Zishan; Yang, Xuanda; Zhao, Pengzhan; Long, Guangcheng; Zhu, Kai; Zhu, Fei; Zhao, Wenyi; Liu, Xiaoyong; Yang, Jun; Zhai, Jidong; Song, Shuaiwen Leon; Lin, Wei

doi:10.1145/3503222.3507723

Cited by 38 publications

(4 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On ThunderX2, nDirect delivers slightly lower performance for the end-to-end inference compared to Ansor, with a speedup of 0.88× to 0.98×. The better performance of Ansor on the whole CNN is due to its ability to optimize across CNN layers through operator fusion [67,72]. This technique can write back operations for intermediate results and fetch operations in the CNNs pipeline, further reducing memory access latency and bandwidth pressure to improve CNNs end-toend performance.…”

Section: End-to-end Inference Timementioning

confidence: 99%

Optimizing Direct Convolutions on ARM Multi-Cores

Wang,

Yang,

Fang

et al. 2023

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Convolution kernels are widely seen in deep learning workloads and are often responsible for performance bottlenecks. Recent research has demonstrated that a direct convolution approach can outperform the traditional convolution implementation based on tensor-to-matrix conversions. However, existing approaches for direct convolution still have room for performance improvement. We present nDirect, a new direct convolution approach that targets ARM-based multi-core CPUs commonly found in smartphones and HPC systems. nDirect is designed to be compatible with the data layout formats used by mainstream deep learning frameworks but offers new optimizations for the computational kernel, data packing, and parallelization. We evaluate nDirect by applying it to representative convolution kernels and demonstrating its performance on four distinct ARM multi-core CPU platforms. We compare nDirect against state-of-the-art convolution optimization techniques. Experimental results show that nDirect gives the best overall performance across evaluation scenarios and platforms. CCS CONCEPTS• Computing methodologies → Machine learning; • Software and its engineering → Compilers.

show abstract

Section: End-to-end Inference Timementioning

confidence: 99%

Optimizing Direct Convolutions on ARM Multi-Cores

Wang,

Yang,

Fang

et al. 2023

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…We use blocks as the granularity to add exits, because the block is the basic unit/step to compress information to a low dimensional space mathematically [13]. In addition, many graph optimizations are done between layers in the same block [14,53]. Adding an exit in a block invalidates the graph optimization, and incurs the high cost of data movement between stages, as there are high data dependencies within a block.…”

Section: Building a Multi-exit Modelmentioning

confidence: 99%

Pame

Zhang

Cui

Chen

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

In emerging DNN serving systems, queries are usually batched to fully leverage hardware resources, and all the queries in a batch run through the complete model and return at the same time. According to our findings, some queries only need to pass through a portion of the DNN model to attain sufficient precision in a DNN service. These queries can have shorter latencies if they can return early in the middle of a model. Therefore, we propose precision-aware multiexit inference serving, PAME, to achieve the above purpose. PAME provides a holistic scheme to build a multi-exit DNN model and a corresponding system-level design of the inference engine. We use representative CV and NLP benchmarks to evaluate PAME. PAME is adaptive to various DNN tasks and service loads. Experimental results show that PAME reduces 39.9% average latency without increasing the tail latency, while maintaining 99.68% precision of the original single-exit DNN models on average. CCS CONCEPTS• Computing methodologies → Artificial intelligence; • Computer systems organization → Real-time systems.

show abstract

“…Mlmodelci [20] provided a one-stop platform for multimedia developers to provide efficient ML services, while NSML [21] created a collaborative environment for users to deploy their own commercial service. AStitch [22] improved the execution efficiency of ML tasks by means of compilation optimization, and avoided unnecessary repeated calculation. FedAMP [23] promoted pairwise collaboration between clients with similar data to significantly improve federated learning performance.…”

Section: Related Workmentioning

confidence: 99%

RCM: Residue-aware Consolidation for Heterogeneous MLaaS Cluster

Zhang

et al. 2022

2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)

View full text Add to dashboard Cite

With the rapid development of Machine Learning (ML), Machine-Learning-as-a-Service (MLaaS) clusters appear in large numbers to support cloud platforms services, which adopt virtual machine (VM) to improve the availability, resilience and security. However, low energy efficiency is a major problem in such clusters. Previous work focused on reducing the number of physical machines by centralizing resources migration. Nevertheless, for ML tasks with frequent memory switching, blind migration is not worth the cost because the remaining time is less than the migration time, since the migration time can not be ignore due to the memory intensive of ML tasks. Therefore, this paper explores how the remaining time and memory replacement states in ML tasks, which we summarize as residue, affect migration, and proposes an online residue-aware migration algorithm based on Lyapunov optimization. Through rigorous proof, the gap between the algorithm and the optimal solution is ensured. Extensive simulations show that the proposed algorithm is better than the previous migration.

show abstract

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Cited by 38 publications

References 46 publications

Optimizing Direct Convolutions on ARM Multi-Cores

Optimizing Direct Convolutions on ARM Multi-Cores

Pame

RCM: Residue-aware Consolidation for Heterogeneous MLaaS Cluster

Contact Info

Product

Resources

About