Structured Compression by Weight Encryption for Unstructured Pruning and Quantization

Kwon, Se Jung; Lee, Dongsoo; Kim, Byeongwook; Kapoor, Parichay; Park, Baeseong; Wei, Gu-Yeon

doi:10.1109/cvpr42600.2020.00198

Cited by 31 publications

(8 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When parameters identified as unimportant are removed first, then quantization can be performed with a reduced number of parameters. Then, the number of quantization bits can also be reduced along with a smaller quantization error [51]. Both fine-grained pruning and structured pruning are investigated because of trade-offs between compression ratio and the regularity of memory access patterns [12], [52].…”

Section: Discussionmentioning

confidence: 99%

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Park¹,

Baeseong²,

Kim³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size, and thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for largescale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs. We demonstrate that nuQmm can accelerate the inference speed of the GPT-3 (175B) model by about 14.4 times and save energy consumption by 93%.

show abstract

Section: Discussionmentioning

confidence: 99%

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Park¹,

Baeseong²,

Kim³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…quantization [30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45], and through compression, i.e. sparsity/pruning [46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64], enables us to optimize the NN significantly. In quantization, low precision is used to represent the weights and activation, whereas in pruning the connections are completely removed.…”

Section: Reduced Precision and Compressionmentioning

confidence: 99%

Smart sensors using artificial intelligence for on-detector electronics and ASICs

Carini¹,

Deptuch²,

Dickinson³

et al. 2022

Preprint

View full text Add to dashboard Cite

Cutting edge detectors push sensing technology by further improving spatial and temporal resolution, increasing detector area and volume, and generally reducing backgrounds and noise. This has led to a explosion of more and more data being generated in next-generation experiments. Therefore, the need for nearsensor, at the data source, processing with more powerful algorithms is becoming increasingly important to more efficiently capture the right experimental data, reduce downstream system complexity, and enable faster and lower-power feedback loops. In this paper, we discuss the motivations and potential applications for on-detector AI. Furthermore, the unique requirements of particle physics can uniquely drive the development of novel AI hardware and design tools. We describe existing modern work for particle physics in this area. Finally, we outline a number of areas of opportunity where we can advance machine learning techniques, codesign workflows, and future microelectronics technologies which will accelerate design, performance, and implementations for next generation experiments.

show abstract

“…Many methods are also proposed to determine weight-zero criteria, such as iterative threshold selection [39] and Hoffman code [40]. Kwon et al [41] proposed a sparsely quantized neural network weight representation scheme, specifically implemented by fine-grained and unstructured pruning methods.…”

Section: Unstructured Pruningmentioning

confidence: 99%

MCA-YOLOV5-Light: A Faster, Stronger and Lighter Algorithm for Helmet-Wearing Detection

Sun

Zhang²,

Qu³

et al. 2022

Applied Sciences

View full text Add to dashboard Cite

It is an essential measure for workers to wear safety helmets when entering the construction site to prevent head injuries caused by object collision and falling. This paper proposes a lightweight algorithm for helmet-wearing detection based on YOLOV5, which is faster and more robust for helmet detection in natural construction scenarios. In this paper, the MCA attention mechanism is embedded in the backbone network to help the network extract more productive information, reduce the missed detection rate of small helmet objects and improve detection accuracy. In order to ensure the safety of workers in construction, it is necessary to detect whether the construction workers are wearing safety helmets in real-time to achieve monitoring on-site. A channel pruning strategy is proposed on the MCA-YOLOv5 algorithm to compress it, realizing the optimal large-scale model into ultrasmall models for real-time detection on embedded or mobile devices. The experimental results on the public data set show that the model parameter volume is reduced by 87.2%, and the detection speed is increased by 53.5%, even though the MCA-YOLOv5-light reduces the mAP slightly.

show abstract

Structured Compression by Weight Encryption for Unstructured Pruning and Quantization

Cited by 31 publications

References 10 publications

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Smart sensors using artificial intelligence for on-detector electronics and ASICs

MCA-YOLOV5-Light: A Faster, Stronger and Lighter Algorithm for Helmet-Wearing Detection

Contact Info

Product

Resources

About