PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation

Kim, Jangho; Chang, Simyung; Kwak, Nojun

doi:10.48550/arxiv.2106.14681

Cited by 5 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the other approaches such as pruning and quantization is capable of balancing a trade-off between accuracy and compression ratio. Some approaches [140], [227], [175] also involves using combination of multiple compression techniques: knowledge distillation, pruning, and quantization to achieve better accuracy and compression ratio.…”

Section: E Energy Efficient Approaches In Autonomous Drivingmentioning

confidence: 99%

A Survey on Approximate Edge AI for Energy Efficient Autonomous Driving Services

Katare

Perino

Nurmi

et al. 2023

IEEE Commun. Surv. Tutorials

View full text Add to dashboard Cite

Autonomous driving services depends on active sensing from modules such as camera, LiDAR, radar, and communication units. Traditionally, these modules process the sensed data on high-performance computing units inside the vehicle, which can deploy intelligent algorithms and AI models. The sensors mentioned above can produce large volumes of data, potentially reaching up to 20 Terabytes. This data size is influenced by factors such as the duration of driving, the data rate, and the sensor specifications. Consequently, this substantial amount of data can lead to significant power consumption on the vehicle. Similarly, a substantial amount of data will be exchanged between infrastructure sensors and vehicles for collaborative vehicle applications or fully connected autonomous vehicles. This communication process generates an additional surge of energy consumption. Although the autonomous vehicle domain has seen advancements in sensory technologies, wireless communication, computing and AI/ML algorithms, the challenge still exists in how to apply and integrate these technology innovations to achieve energy efficiency. This survey reviews and compares the connected vehicular applications, vehicular communications, approximation and Edge AI techniques. The focus is on energy efficiency by covering newly proposed approximation and enabling frameworks. To the best of our knowledge, this survey is the first to review the latest approximate Edge AI frameworks and publicly available datasets in energy-efficient autonomous driving. The insights from this survey can benefit the collaborative driving service development on low-power and memory-constrained systems and the energy optimization of autonomous vehicles.

show abstract

Section: E Energy Efficient Approaches In Autonomous Drivingmentioning

confidence: 99%

A Survey on Approximate Edge AI for Energy Efficient Autonomous Driving Services

Katare

Perino

Nurmi

et al. 2023

IEEE Commun. Surv. Tutorials

View full text Add to dashboard Cite

show abstract

“…Cui and Li, the architects of [18], unveil a complex model compression approach that combines structural pruning with dense knowledge distillation for large language models. Kim et al [19] address the needs of edge devices with PQK, an innovative combination of pruning, quantization, and knowledge distillation. A structured progression of pruning, quantization, and distillation provides a comprehensive strategy for efficient edge-based model deployment.…”

Section: Combination Of Pruning and Knowledge Distillationmentioning

confidence: 99%

Efficient and Controllable Model Compression through Sequential Knowledge Distillation and Pruning

Malihi,

Heidemann

2023

BDCC

View full text Add to dashboard Cite

Efficient model deployment is a key focus in deep learning. This has led to the exploration of methods such as knowledge distillation and network pruning to compress models and increase their performance. In this study, we investigate the potential synergy between knowledge distillation and network pruning to achieve optimal model efficiency and improved generalization. We introduce an innovative framework for model compression that combines knowledge distillation, pruning, and fine-tuning to achieve enhanced compression while providing control over the degree of compactness. Our research is conducted on popular datasets, CIFAR-10 and CIFAR-100, employing diverse model architectures, including ResNet, DenseNet, and EfficientNet. We could calibrate the amount of compression achieved. This allows us to produce models with different degrees of compression while still being just as accurate, or even better. Notably, we demonstrate its efficacy by producing two compressed variants of ResNet 101: ResNet 50 and ResNet 18. Our results reveal intriguing findings. In most cases, the pruned and distilled student models exhibit comparable or superior accuracy to the distilled student models while utilizing significantly fewer parameters.

show abstract

“…The commonly used model compression techniques [23][24][25][26][27] are quantization [28][29][30][31][32][33][34] and pruning [35][36][37][38][39] -weight pruning, layer pruning, filter pruning, and channel pruning. Other approaches are KD, 40 layer partitioning, channel partitioning, spatial partitioning, and workload partitioning. Layer-wise, fused layer parallelization, 41,42 and fused layer partitioning 5 are the few other strategies for implementing DL at the edge.…”

Section: Background and Literature Surveymentioning

confidence: 99%

Memory optimization at Edge for Distributed Convolution Neural Network

Naveen

Kounte

2022

Trans Emerging Tel Tech

View full text Add to dashboard Cite

Internet of Things (IoT) edge intelligence has emerged by optimizing the deep learning (DL) models deployed on resource‐constraint devices for quick decision‐making. In addition, edge intelligence reduces network overload and latency by bringing intelligent analytics closer to the source. On the other hand, DL models need a lot of computing resources. As a result, they have high computational workloads and memory footprint, making it impractical to deploy and execute on IoT edge devices with limited capabilities. In addition, existing layer‐based partitioning methods generate many intermediate results, resulting in a huge memory footprint. In this article, we propose a framework to provide a comprehensive solution that enables the deployment of convolutional neural networks (CNNs) onto distributed IoT devices for faster inference and reduced memory footprint. This framework considers a pretrained YOLOv2 model, and a weight pruning technique is applied to the pre‐trained model to reduce the number of non‐contributing parameters. We use the fused layer partitioning method to vertically partition the fused layers of the CNN and then distribute the partition among the edge devices to process the input. In our experiment, we have considered multiple Raspberry Pi as edge devices. Raspberry Pi with a neural computing stick is a gateway device to combine the results from various edge devices and get the final output. Our proposed model achieved inference latency of 5 to ∼$$ \sim $$7 seconds for 3prefix×3$$ 3\times 3 $$ to 5prefix×5$$ 5\times 5 $$ fused layer partitioning for five devices with a 9% improvement in memory footprint.

show abstract

PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation

Cited by 5 publications

References 18 publications

A Survey on Approximate Edge AI for Energy Efficient Autonomous Driving Services

A Survey on Approximate Edge AI for Energy Efficient Autonomous Driving Services

Efficient and Controllable Model Compression through Sequential Knowledge Distillation and Pruning

Memory optimization at Edge for Distributed Convolution Neural Network

Contact Info

Product

Resources

About