2022
DOI: 10.48550/arxiv.2206.09557
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Abstract: The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size, and thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for largescale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multipl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(17 citation statements)
references
References 24 publications
0
17
0
Order By: Relevance
“…Multi-billion Scale Transformer Quantization. There are two methods that were developed in parallel to ours: nuQmm (Park et al, 2022) and ZeroQuant (Yao et al, 2022). Both use the same quantization scheme: group-wise quantization, which has even finer quantization normalization constant granularity than vector-wise quantization.…”
Section: Related Workmentioning
confidence: 99%
“…Multi-billion Scale Transformer Quantization. There are two methods that were developed in parallel to ours: nuQmm (Park et al, 2022) and ZeroQuant (Yao et al, 2022). Both use the same quantization scheme: group-wise quantization, which has even finer quantization normalization constant granularity than vector-wise quantization.…”
Section: Related Workmentioning
confidence: 99%
“…With the recent open-source releases of models like BLOOM [16] or OPT-175B [35], researchers have started to develop affordable methods for compressing such giant networks for inference. To our knowledge, all existing works-ZeroQuant [34], LLM.int8() [5], and nuQmm [24]-employ relatively simple quantization schemes based on rounding to the nearest (RTN) quantization level. This simple approach has the advantage of maintaining acceptable runtimes for very large models.…”
Section: Large-model Quantizationmentioning
confidence: 99%
“…Our primary baseline, denoted by RTN, consists of rounding all weights to the nearest quantized value on the same grid that is also used for GPTQ. This is currently the method of choice in all works on quantization of very large language models [5,34,24]: its runtime scales well to networks with many billions of parameters since it simply performs direct weight rounding in a single pass. As we will also discuss in detail, more accurate methods, such as AdaRound [20] or BRECQ [17], are currently far too slow for models with many billions of parameters, the main focus of this work.…”
Section: The Gptq Algorithmmentioning
confidence: 99%
See 2 more Smart Citations