PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices

Hu, Yang; Imes, Connor; Zhao, Xuanang; Kundu, Souvik; Beerel, Peter A.; Crago, Stephen P.; Walters, John Paul

doi:10.1109/dsd57027.2022.00048

Cited by 15 publications

(3 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The large-scale DNN deployment and inference on the heterogeneous edge and network was designed by Hu et al 16 It uses the dynamic programming to find the optimal partition of the model using the pipeline parallelism. Furthermore, Cai et al 17 presented ParaTra which is a transformer-based model inference that works on edge devices that often have limited GPU resources.…”

Section: Related Workmentioning

confidence: 99%

Latency-aware service placement for GenAI at the edge

Thapa,

Mashayekhy

2024

Disruptive Technologies in Information Sciences VIII

View full text Add to dashboard Cite

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) and Generative AI (GenAI) have emerged as front-runners in shaping the next generation of intelligent applications, where humanlike data generation is necessary. While their capabilities have shown transformative potential in centralized computing environments, there is a growing shift towards decentralized edge AI models, where computations are orchestrated closer to data sources to provide immediate insights, faster response times, and localized intelligence without the overhead of cloud communication. For latency-critical applications like autonomous vehicle driving, GenAI at the edge is vital, allowing vehicles to instantly generate and adapt driving strategies based on everchanging road conditions and traffic patterns. In this paper, we propose a latency-aware service placement approach, designed for the seamless deployment of GenAI services on these cloudlets. We represent GenAI as a Direct Acyclic Graph, where GenAI operations represent the nodes and the dependencies between these operations represent the edges. We propose an Ant Colony Optimization approach that guides the placement of GenAI services at the edge based on capabilities of cloudlets and network conditions. Through experimental validation, we achieve notable GenAI performance at the edge with lower latency and efficient resource utilization. This advancement is expected to revolutionize and innovate in the field of GenAI, paving the way for more efficient and transformative applications at the edge.

show abstract

Section: Related Workmentioning

confidence: 99%

Latency-aware service placement for GenAI at the edge

Thapa,

Mashayekhy

2024

Disruptive Technologies in Information Sciences VIII

View full text Add to dashboard Cite

show abstract

“…Testing workloads Partitioning method # devices [18] Raspberry Pi 3B (DarkNet) YOLOv2 Fused Tile partitioning 1-6 [33] LG Nexus 5 (MxNet) VGG-16 Biased one-dimensional partition 2-4 [19] Raspberry Pi 3B+ (DarkNet) VGG-16, YOLO Fused-layer parallelization 1-8 [30] MinnowBoard, RCC-VE Network Board (PyTorch) ViT-Base, ViT-Large, ViT-Huge Layer-level splitting 16 [31] Raspberry Pi 3B + Jetson TX2 (Keras-TensorFlow) as TensorFlow, Keras, MXNet, PyTorch, and DarkNet. In addition, TVM includes auto-tuner tools -i.e., autoTVM [37] and Ansor [38] -to automatically apply graph-level and operator-level optimizations -e.g., operation fusion or data transformations -to network graphs, generating highly efficient machine code.…”

Section: B Motivationmentioning

confidence: 99%

“…Lowcost low-energy devices, such as Raspberry Pi and Odroid, have also been employed in edge-cloud hybrid solutions with dynamic network conditions [15], [29]. Multi-node edgeedge solutions can also optimize network execution by applying layer fusion, data parallelism, or network partition over clusters of edge devices [18], [19], [30]- [33]. For instance, [34] proposes spatial and channel partitioning to parallelize convolutional layers in multiple devices using dynamic programming-based search.…”

Section: Introductionmentioning

confidence: 99%

A Pipelining-Based Heterogeneous Scheduling and Energy-Throughput Optimization Scheme for CNNs Leveraging Apache TVM

et al. 2023

View full text Add to dashboard Cite

Extracting information of interest from continuous video streams is a strongly demanded computer vision task. For the realization of this task at the edge using the current de-facto standard approach, i.e., deep learning, it is critical to optimize key performance metrics such as throughput and energy consumption according to prescribed application requirements. This allows achieving timely decisionmaking while extending the battery lifetime as much as possible. In this context, we propose a method to boost neural-network performance based on a co-execution strategy that exploits hardware heterogeneity on edge platforms. The enabling tool is Apache TVM, a highly efficient machine-learning compiler compatible with a diversity of hardware back-ends. The proposed approach solves the problem of network partitioning and distributes the workloads to make concurrent use of all the processors available on the board following a pipeline scheme. We conducted experiments on various popular CNNs compiled with TVM on the Jetson TX2 platform. The experimental results based on measurements show a significant improvement in throughput with respect to a single-processor execution, ranging from 14% to 150% over all tested networks. Power-efficient configurations were also identified, accomplishing energy reductions above 10%.

show abstract