2020
DOI: 10.1109/tpds.2020.3041474
|View full text |Cite
|
Sign up to set email alerts
|

Model Parallelism Optimization for Distributed Inference via Decoupled CNN Structure

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
17
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(17 citation statements)
references
References 22 publications
0
17
0
Order By: Relevance
“…Model parallelism [ 70 , 71 , 72 , 73 , 74 , 75 , 76 , 78 , 80 , 81 , 83 , 84 , 86 , 87 , 89 ] attempts to address both issues by employing partitioning strategies with a finer granularity to produce less expensive DNN subtasks, i.e., partitions with fewer parameters and fewer computation requirements than a layer, and foster more adaptable co-inference schemes. The computations required for a single input are distributed across multiple computing entities, reducing the time needed to process the shared input [ 76 ] but delivering a performance that, as opposed to what has been indicated for pipeline parallelism, is highly dependent on the distribution of such computations across devices.…”
Section: Dnn Partitioning and Parallelism For Collaborative Inferencementioning
confidence: 99%
See 1 more Smart Citation
“…Model parallelism [ 70 , 71 , 72 , 73 , 74 , 75 , 76 , 78 , 80 , 81 , 83 , 84 , 86 , 87 , 89 ] attempts to address both issues by employing partitioning strategies with a finer granularity to produce less expensive DNN subtasks, i.e., partitions with fewer parameters and fewer computation requirements than a layer, and foster more adaptable co-inference schemes. The computations required for a single input are distributed across multiple computing entities, reducing the time needed to process the shared input [ 76 ] but delivering a performance that, as opposed to what has been indicated for pipeline parallelism, is highly dependent on the distribution of such computations across devices.…”
Section: Dnn Partitioning and Parallelism For Collaborative Inferencementioning
confidence: 99%
“…This has been mainly motivated by the very nature and the subsequent memory and computing demands of the different layer types considered. Analyses carried out in this context [ 70 , 76 , 83 ] have confirmed that, while convolutional and fully connected layers account for the majority of the memory footprint and computational complexity of CNN models, the effect of non-parametric layers, i.e., pooling layers and activation functions, in this regard and, therefore, their contribution to the overall computational cost of the network, can be considered negligible [ 71 ]. Thus, when partitioned, the non-parametric layers are typically grouped with their corresponding parent layer, i.e., the layer that generated their input, instead of receiving explicit treatment [ 76 ].…”
Section: Dnn Partitioning and Parallelism For Collaborative Inferencementioning
confidence: 99%
“…In addition, it impacts various performance metrics such as end to end latency, communication size, throughput, accuracy, memory footprint, and energy consumption. Existing research focus on distributed CNN inference on resource‐constrained IoT devices 5‐10 to reduce latency, memory footprint, and communication size by using various approaches such as partial data upload to cloud, training at cloud ‐ inference at the edge, distributed training in edge‐cloud, or fog‐cloud training. Our extensive survey on DL at edge devices has identified the following limitations.…”
Section: Introductionmentioning
confidence: 99%
“…In addition, it impacts various performance metrics such as end to end latency, communication size, throughput, accuracy, memory footprint, and energy consumption. Existing research focus on distributed CNN inference on resource-constrained IoT devices [5][6][7][8][9][10] to reduce (2) Require enormous computational resources such as processing capabilities and memory footprint for implementing intelligent edge systems. (3) Necessity of using compression techniques to optimize the existing approaches to best utilize the available resources for successful implementation of edge intelligent systems.…”
Section: Introductionmentioning
confidence: 99%
“…Even though methods such as model parallelism can be used to split the model between multiple GPUs during both the training [14], [15] and inference [16] phases, and thus avoid memory and latency issues, these methods require a large amount of resources, such as a large number of GPUs and servers, which can incur high costs, especially when working with extreme resolutions such as gigapixel images. Furthermore, in many applications, such as self-driving cars and drone image processing, there is a limit for the hardware that can be mounted, and offloading the computation to external servers is not always possible because of unreliability of the network connection due to movement and the timecritical nature of the application.…”
Section: Introductionmentioning
confidence: 99%