nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Park, Gunho; Baeseong, Park,; Kim, Minsub; Lee, Sungjae; Kim, Jeong-Hoon; Kwon, Beomseok; Kwon, Se Jung; Kim, Byeongwook; Lee, Young‐Joo; Lee, Dongsoo

doi:10.48550/arxiv.2206.09557

Cited by 13 publications

(17 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-billion Scale Transformer Quantization. There are two methods that were developed in parallel to ours: nuQmm (Park et al, 2022) and ZeroQuant (Yao et al, 2022). Both use the same quantization scheme: group-wise quantization, which has even finer quantization normalization constant granularity than vector-wise quantization.…”

Section: Related Workmentioning

confidence: 99%

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers¹,

Lewis²,

Belkada³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open source our software. * Majority of research done as a visiting researcher at Facebook AI Research. 2 Other parameters come mostly from the embedding layer. A tiny amount comes from norms and biases.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers¹,

Lewis²,

Belkada³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…With the recent open-source releases of models like BLOOM [16] or OPT-175B [35], researchers have started to develop affordable methods for compressing such giant networks for inference. To our knowledge, all existing works-ZeroQuant [34], LLM.int8() [5], and nuQmm [24]-employ relatively simple quantization schemes based on rounding to the nearest (RTN) quantization level. This simple approach has the advantage of maintaining acceptable runtimes for very large models.…”

Section: Large-model Quantizationmentioning

confidence: 99%

“…Our primary baseline, denoted by RTN, consists of rounding all weights to the nearest quantized value on the same grid that is also used for GPTQ. This is currently the method of choice in all works on quantization of very large language models [5,34,24]: its runtime scales well to networks with many billions of parameters since it simply performs direct weight rounding in a single pass. As we will also discuss in detail, more accurate methods, such as AdaRound [20] or BRECQ [17], are currently far too slow for models with many billions of parameters, the main focus of this work.…”

Section: The Gptq Algorithmmentioning

confidence: 99%

“…At 3bit, RTN collapses, while GPTQ is still able to maintain good performance on most tasks, losing only 0.3 − 0.5 points for more than 5× compression. We note that GPTQ's accuracy can be further improved via finer-granularity grouping/bucketing [1,24]: with group-size 1024, GPTQ's OPT-175B 3-bit WikiText2 PPL further improves from 8.68 to 8.45. We examine the impact of group size more carefully below.…”

Section: The Gptq Algorithmmentioning

confidence: 99%

“…While our experiments so far have focused exclusively on vanilla rowwise quantization, we now show that GPTQ is also compatible with more advanced quantization tricks, leading to further improvements. Specifically, we investigate standard grouping [1,24], i.e. applying independent quantization to groups of G consecutive weights.…”

Section: The Gptq Algorithmmentioning

confidence: 99%

See 2 more Smart Citations

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar¹,

Ashkboos²,

Hoefler³

et al. 2022

Preprint

View full text Add to dashboard Cite

Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highlyaccurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for endto-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

show abstract

A survey on deep reinforcement learning for audio-based applications

Latif

Cuayáhuitl

Pervez³

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields including computer vision, natural language processing, healthcare, robotics, to name a few. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising applications in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together research studies across different but related areas in speech and music. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting important challenges faced by audio-based DRL agents and by highlighting open areas for future research and investigation. The findings of this paper will guide researchers interested in DRL for the audio domain.

show abstract

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Cited by 13 publications

References 24 publications

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

A survey on deep reinforcement learning for audio-based applications

Contact Info

Product

Resources

About