Radio: Rate–Distortion Optimization for Large Language Model Compression

Sean I. Young

1 2

Martinos Center, Harvard Medical School, Boston, MA, USA.

Computer Science and Artificial Intelligence Lab (CSAIL),

MIT, Cambridge, MA, USA. Correspondence to: Sean I. Young

<siyoung@mit.edu>.

Proceedings of the 42nd International Conference on Machine

by the author(s).

Abstract

In recent years, the compression of large language

models (LLMs) has emerged as a key problem in

facilitating LLM deployment on resource-limited

devices, reducing compute costs, and mitigating

the environmental footprint due to large-scale AI

infrastructure. Here, we establish the foundations

of LLM quantization from a rate–distortion theory

perspective and propose a quantization technique

based on simple rate–distortion optimization. Our

technique scales to models containing hundreds of

billions of weight parameters and offers users the

flexibility to compress models, post-training, to a

model size or accuracy specified by the user.

1. Introduction

Large Language Models (LLMs) have become a universal

framework for solving a wide range of problems in natural

language processing, ranging from text summarization and

translation to conversational AI. While LLMs have already

surpassed traditional methods in many of these tasks, they

involve tens to hundreds of billions of weight parameters

(!), rendering their deployment onto devices with limited

resources challenging—model weights and activations far

exceed the available on-chip memory so that weights need

to be loaded from the off-chip (global) memory throughout

inference, rendering LLM inference memory-bound (Yuan

et al., 2024). This greatly hinders the applicability of LLMs

particularly in time-sensitive applications and exacerbates

the environmental footprint of large-scale AI infrastructure

required by LLMs.

One way to reduce the memory requirements of LLMs for

inference is by simplifying the model representation post-

training. Quantization of the model weights and activation

has proven to be particularly apt at compressing models to

low bit depths or even arbitrary model sizes (Dettmers et

al., 2022; Yao et al., 2022; Frantar et al., 2022; Frantar &

Alistarh, 2022; Kim et al., 2024; Shao et al., 2024; Lee et

al., 2024; Guan et al., 2024). Using state-of-the-art model

quantization techniques, it is possible to compress 10–100

billion-parameter LLMs to 3–4 bits per weight on average

and incur only a negligible loss in model accuracy (Chee et

al., 2024; Frantar et al., 2022; Lin et al., 2024). This enables

LLM inference on a single consumer-grade GPU.

Despite the practical advances made in model quantization

methods in recent years, rate–distortion theoretic aspects of

LLM quantization are seldom studied in earlier works. By

far, the most extensively studied and extended framework

for LLM quantization is Optimal Brain Compression and

GPTQ of Frantar et al. (2023), itself an adaptation of the

classic Optimal Brain Surgery (OBS) algorithm (Hassibi &

Storck, 1992). Since OBS operates outside the framework

of rate and distortion—which serves as the basis of many

related compression problems—a rate–distortion theoretic

characterization and exposition of LLM compression can

significantly enhance our understanding of the problem at

hand and guide the design of LLM compression methods.

In this paper, we tackle the problem of LLM compression

using a rate–distortion framework. We begin by analyzing

how a model’s weights should be quantized to maximize

quantized model accuracy at a given average bit depth (bit

rate). After this, we propose a stochastic gradient descent-

type method to solve this optimization problem exactly and

efficiently, post-training—in minutes for billion-parameter

models and in a few hours for 10–100-billion-parameter

models. Compared with OPTQ and its extensions (Frantar

et al., 2022; Frantar & Alistarh, 2022; Huang et al., 2024;

Lee et al., 2024; van Baalen et al., 2024), in which weights

must be fine-tuned during quantization, our rate–distortion

framework more simply determines optimal bit depths and

uses integer rounding heuristic for the actual quantization

the optimum bit depths have been determined. This renders

our framework also suited for quantizing the intermediate

activations, which can further reduce the memory footprint

of batched inference.

More specifically, our contributions are as follows.

• We formulate a rate–distortion theoretic framework for

rate–distortion optimal quantization of LLMs.

• We design a stochastic ascent algorithm for solving the

resulting rate–distortion optimization problem.

• We quantize LLMs across model types and sizes and

show the rate–distortion behavior of quantized LLMs.

2. Previous Work

Earlier work on neural network model quantization can be

traced back to Vanhoucke et al. (2011), who demonstrated

Radio: Rate–Distortion Optimization for Large Language Model Compression

that 8-bit integer arithmetic is sufficient for neural network

training and inference without incurring a significant loss

of model accuracy. In general, quantization-aware training

(QAT) (Zhou et al., 2017; Jacob et al., 2018; D. Zhang et

al., 2018; Esser et al., 2019; Y. Choi et al., 2017; Wang et

al., 2019) integrates the quantization process into training

by allowing the model to adapt to the reduced bit precision

in weights (Esser et al., 2019; Jacob et al., 2018; D. Zhang

et al., 2018; Zhou et al., 2017) and activations (Y. Choi et

al., 2017; Wang et al., 2019), determining the optimal bit

depth (Wang et al., 2019; D. Zhang et al., 2018) and step

size (Esser et al., 2019) using backpropagation to facilitate

the flow of gradient through to quantization operators. One

shortcoming of QAT methods is that model training needs

to be repeated for different quantized model bit depths and

accuracy, which can render them less suited for quantizing

larger neural network models such as LLMs.

More recent quantization methods for language and vision

models aim to facilitate compression of pre-trained models

for rapid deployment without additional training (Dong et

al., 2019; Chen et al., 2021; Dettmers et al., 2022; Yao et

al., 2022; Frantar et al., 2022; Dettmers et al., 2023; Xiao

et al., 2023; Lin et al., 2024; Kim et al., 2024; Shao et al.,

2024; Lee et al., 2024). These methods quantize weights to

3–4 or 8 bits for integer-arithmetic-only inference (Jacob et

al., 2018) by using mixed bit depth quantization (Wang et

al., 2019; Chen et al., 2021) or by a separate handling of

outlier channels (Zhao et al., 2019) to improve the accuracy

of the quantized model. Loss-aware quantization methods

(Hou & Kwok, 2018; Nahshan et al., 2020; Qu et al., 2020)

seek to minimize the accuracy loss in quantized models by

calibrating quantization and biases on a set of calibration

examples. Data-free quantization (Nagel et al., 2019; Xu et

al., 2020; K. Choi et al., 2021; Qian et al., 2023) attempts

to remove the need for real calibration data by matching the

distribution of weights instead (Nagel et al., 2019) or using

synthetic data in place of real calibration data (K. Choi et

al., 2021).

For the compression of LLMs in particular, an extension to

the Optimum Brain Surgeon (OBS) algorithm (Hassibi &

Stork, 1992) referred to as GPTQ (Frantar et al., 2022) has

been proposed for compressing 10–100 billion parameter

models. More recent extensions (Dettmers et al., 2023; Lee

et al., 2024) to GPTQ incorporate the handling of the more

sensitive model weights by re-scaling them or by retaining

the original weight values similar to (Lin et al., 2024; Xiao

et al., 2023), low-rank decomposition of quantization error

matrices (Shao et al., 2024) as well as orthogonal transform

of weight matrices prior to their quantization (Ashkboos et

al., 2024). While mixed-precision weight quantization is a

promising paradigm for handling weights with different

sensitivity, current mixed-precision approaches (Wang et

al., 2019; Chen et al., 2021; Lee et al., 2024; Dettmers et

al., 2023) assign different bit depths from a limited set of

bit-depth options (e.g., 4 or 16 bits) or only across different

layers. This is due to the combinatorial nature of mixed bit

depth assignment and limits the attainable quantized model

accuracy especially for LLMs with hundreds of billions of

parameters. Appealing to a rate–distortion framework can

not only overcome the combinatorial nature of determining

the optimal bit depths (0, 1, . . . , 8 bits) at a finer level of

granularity (e.g., per channel or per weight group) but also

enhance our understanding the LLM compression problem

through the lens of rate–distortion theory, which has been

fundamental to understanding and solving the quantization

problem in image, audio and video domains. Extensions to

transform coding and activation quantization are discussed

in the sequel.

3. Quantization Framework

Here, we use the task of next-token prediction in language

modeling as a running example. For our purposes, the end-

to-end mapping of input token embeddings to its predicted

next-token embeddings by a pretrained language model can

be expressed in the most general form as

= () = (, 

, . . . , 

𝑁

, 

, . . . , 

𝑁

(1)

in which 

𝐿×𝐸

denotes a sequence of  tokens, each

one residing in some -dimensional embedding space, and



𝐿×𝐸

are embeddings of  predicted next tokens. The

th block of weight matrices (

𝑚𝑀 +1

, . . . , 

(𝑚+1)𝑀

) and

the bias vectors (

𝑚𝑀 +1

, . . . , 

(𝑚+1)𝑀

) parametrize the th

transformer block, which refines the embeddings produced

by the (−1)th transformer block. LLM frameworks used

in language modeling typically also require an embedding

matrix 



𝐸×𝑉

and prediction head 

𝑁 +1



𝑉 ×𝐸

convert between embeddings and tokens from a vocabulary

of some size . In this work, we focus on the compression

of transformer block weights as customarily done in most

model weight quantization work (Frantar et al., 2022; Lee

et al., 2024; Lin et al., 2024).

To get a sense of the number of weight matrices and their

sizes in a typical language model, the 13 billion-parameter

Figure 1: Optimum bit depths. Consider two weight matrices

such that their distortion functions are given by 𝑑

and

𝑑

, where

𝑑

𝑛

(𝐵

𝑛

) = 𝐺

𝑛

𝑆

𝑛

−2𝐵

𝑛

. For any value of the dual

𝑉 , optimal bit

depths 𝐵

∗

and

𝐵

∗

are found where the derivative of

𝑑

(resp.

𝑑

)

is −𝑉 (left). These points correspond to the intersections between

𝑉 and −𝑑

𝑛

′

= (2 ln 2) 𝑑

𝑛

(right). Integerized bit depths are on −𝑑



𝑛

′

–1

0 1 2 3 4

Bit depth (𝐵)

Derivative of distortion

0 1 2 3 4

Normalized distortion

Bit depth (𝐵)

𝑑

Feasible

region

𝑉 = 1

𝐵

∗

𝐵

∗

𝑉 = 1

𝑉 = 4

𝐵

∗

−𝑑

′

−𝑑



′

Radio: Rate–Distortion Optimization for Large Language Model Compression

model in the Meta OPT family contains = 240 weight

matrices in blocks of = 6, with each block comprising

12

weights in an embedding dimension = 5120. The

embedder and prediction head are jointly parameterized by

one matrix containing  weights, where the vocabulary

size  = 50272. Each transformer block also contains 9

bias parameters but due to their relative scarcity, these bias

parameters can be communicated losslessly and still have

little to no impact on the overall compression performance

(Frantar et al., 2022).

Notionally, the elements of 

𝑛

are continuously valued so

they require quantization for efficient communication and

storage. Compared with vector quantization techniques of

(Egiazarian et al., 2024; Gong et al., 2015; van Baalen et

al., 2024), scalar quantization (Frantar et al., 2022; Lin et

al., 2024) simplifies decoding and even enables operations

directly on quantization indices, which obviates the need

for a separate dequantization process. If mid-rise uniform

scalar quantization is used, dequantization of a weight  at

a bit-depth of  bits and a step size  can be expressed as



𝑞

(, ) = 



clip( 

⁄

, −2

𝐵−1

, 2

𝐵−1

−1) + 2

−1



(2)

for = 0, 1, 2, . . . , and 

𝑞

=  when =  for notational

convenience. The problem of compressing a model  then

boils down to determining the optimal bit depth  and the

quantization step  for each group of model weights. It is

impractical, of course, to determine a separate (, ) for

each weight  in the model since the cost of signaling the

choice of (, ) for each one would greatly exceed the bit

savings derived from quantization. Typically, a (, ) pair

is used to quantize a small group of weights (e.g., an entire

matrix or rows or columns thereof) in which case the cost

of signaling (, ) is borne by a group of quantized weight

parameters as a negligible per-weight overhead.

3.1. Bit Depth Assignment

Suppose we want to compress  by quantizing each matrix



𝑛

containing 

𝑛

elements according to its own bit depth



𝑛

and step size 

𝑛

∗

(

𝑛

). Generally speaking, weights that

are more sensitive to output distortion should be allocated

more bits to “balance the scales” while the total number of

bits is kept under some model bit budget. We can formalize

this notion by expressing the quantization problem at hand

as a constrained least-squares problem:

min (

{



𝑛

}

) = 

𝐗



(,

{



𝑛

𝑞

(

𝑛

)

}

𝑛=1

𝑁

) −()



s. t. (

{



𝑛

}

) =  

𝑛



𝑛

−



 

𝑛

𝑁

𝑛=1



= 0

𝑁

𝑛=1

(3)

in which  is a user-specified average model bit depth (or

bit rate) and 

𝑛

𝑞

(

𝑛

) = 

𝑛

𝑞

(

𝑛

, 

𝑛

∗

(

𝑛

)) for brevity. This is

a problem similar to optimal resource allocation, where the

objective is to maximize some utility (minimizing output

distortion in this context) by optimally spending down a

given budget (the total number of bits). In this section and

next, we provide insights into problem (3) and discuss its

optimization and solution; see Algorithm 1.

To apply the machinery of numerical optimization, we can

relax the discrete constraint on the bit depths 

, . . . , 

𝑁

of (3) while solving the problem and round off the solution



∗

, . . . , 

𝑁

∗

to their nearest integers after we have obtained

them. Expressing the Lagrangian of (3) as (

{



𝑛

}

, ) =

(

{



𝑛

}

) + (

{



𝑛

}

), where   is the dual variable for

the equality constraint of (3), we set equal to 0 the partials

of  with respect to 

, . . . , 

𝑁

!, . This yields the first

order, rate–distortion optimality conditions



𝑛



(

{



𝑛

}

)



𝑛

= − for all 

𝑛

, 

(

{



𝑛

}

)

= 0,

(4)

so, problem (3) can be solved by alternately updating the

bit depths 

, . . . , 

𝑁

(primal variables) and the trade-off

 (dual variable) until all optimality conditions are met. In

other words, the optimality conditions are reached once the

marginal decrease in the output distortion by assigning an

infinitesimal bit is equal across layers at − and once we

have assigned exactly  bits per weight on average.

Since the quantization function (2) is constant a. e., a naive

computation of the partial derivatives of  with respect to



, . . . , 

𝑁

using the chain rule of differentiation does not

provide a useful direction for descent. A classic result from

rate–distortion theory (Gersho & Gray, 1991) is that for

any random variable with finite variance, its quantization

error decreases by half with every additional bit at a

sufficiently high bit depth. Specifically, to our problem, one

can verify that (Appendix B)

−

2 ln 2

(

{



𝑛

}

)



𝑛



𝐗



({

𝑛

𝑞

(

𝑛

)}

𝑛=1

𝑁

)



𝑛



𝑛

𝑞

(

𝑛

)





𝑛



𝑛



𝑛



𝑛

−2𝐵

𝑛



𝑛

(

𝑛

)

(5)

in which 

𝑛

and 

𝑛

represent the variances of the elements

Algorithm 1. Radio: Rate–Distortion Optimization for LLM

Compression

1 Input: f ( ⋅ ,Θ

, . . . ,Θ

) (model), {X} (calibration set),

R (target bit rate), B

max

← 8 (max bit depth)

Output: B

, . . . ,B

(bit depths), S

, . . . ,S

(weight scales),

, . . . ,µ

(weight means)

Initialize: U ← pca_basis ({X}) ∈ ℝ

E×E'

, V ← 10

← ∞, G

← 0, µ

← mean(Θ

), S

← std(Θ

← Θ

, b

← b

, X

← 0 forall n

for iter in 1, . . . ,max_iter do

for X in minibatch do

Z,X

, . . . ,X

← f (X,Θ

, . . . ,Θ

, . . . ,B

)

← (1' − α)X

+ (α/L)1

forall n

, . . . ,Γ

← autograd(S

ZU, Θ

, . . . , Θ

)

← (1 − α)G

+ (α/P

) trace (Γ

) forall n

for _ in 1, . . . ,10 do

← clamp (

log

(2 ln 2G

/V),0, B

max

) forall n

V ← V + β (sum(P

) − (sum(P

)) R)

← compand_quantize(Θ

,'µ

18 b

← b

+ (Θ

− Θ

forall n

Radio: Rate–Distortion Optimization for Large Language Model Compression

of 

𝚯

𝑛

(, 

𝑞

, . . . , 

𝑁

𝑞

), and of 

𝑛

𝑞

, respectively, and 

𝑛

is a quantization coefficient that depends only on the type

of weight distribution, with 

𝑛

= 1.42 for Gaussian, 0.72

for Laplace, etc. (Gersho & Gray, 1991). Assuming weights

are distributed across layers with 

=  = 

𝑁

, factors



𝑛

and the constant −

2 ln 2

can be removed from the above

expression without affecting the solution of (3).

Coupled with the above closed-form expression (5) for the

partial derivatives, optimality conditions (4) naturally lend

themselves to dual ascent methods for solving (3). The idea

underlying dual ascent (Boyd et al., 2011) is to alternately

update the primal 

, . . . , 

𝑁

, and dual  variables, with

one set held fixed while updating the other variables. After

initializing 

= 

𝑁

= ,  to some small positive

number, and computing 

, . . . , 

𝑁

, both sets of variables

{



𝑛

}

,  can be updated iteratively via



𝑛

clamp



log





𝑛



𝑛

/2 ln 2



, 0, 

max

= 8



 + 



 

𝑛



𝑛

𝑁

𝑛=1

−



 

𝑛

𝑁

𝑛=1







(6)

in which  represents a step size for dual update. Figure 1

illustrates the optimality conditions for bit depths. With 

𝑛

and 

𝑛

fixed, dual ascent steps (6) converge within a few

iterations (tol = 10

−6

bit, step size = 2) after which the

obtained 

𝑛

are rounded to integers. The non-linear nature

of the least squares objective  (3) means that iteration (6)

needs to be repeated once bit depths 

𝑛

are updated. Using

the updated 

𝑛

, the quantized weights 

𝑛

𝑞

(

𝑛

) are obtained

along with the new gradient variances 

𝑛

, based on which

the variables

{



𝑛

}

can be further updated via (6).

Evaluating 

𝚯

𝑛

(,

{



𝑛

𝑞

(

𝑛

)

}

𝑛=1

𝑁

) across the entire set of

calibration examples at every iteration can be prohibitively

expensive due to the dimensions of () 

𝐿×𝐸

and the

cost of back-propagating its elements of through . We can

overcome this difficulty by performing PCA on () along

the embedding dimension (of ) and sub-sampling across

the token dimension (of ), and accumulating variances of

gradients by back-propagating only a batch of examples at

every iteration:



𝑛



𝑛

+

𝐗∼batch





(

,

{



𝑛

𝑞

(



𝑛

)}

𝑛=1

𝑁

)





𝑛



(7)

in which 

𝑇

and  denote the PCA projection, and the sub-

sampling operators, respectively. In practice, we accelerate

the variance accumulation further by cycling through PCA

coefficients and back-propagating only one coefficient per

sample in every minibatch.

3.2. Quantization Step Sizes

Suppose now that weight matrices 

, . . . , 

𝑁

need to be

assigned bit depths 

, . . . , 

𝑁

(which are not necessarily

optimal yet.) We now investigate how the quantization step

size 

𝑛

should be decided given bit depth 

𝑛

. In the classic

round-to-nearest scheme (RTN, Figure 2, left), 

𝑛

is often

chosen so that the quantizer’s 2

𝐵

𝑛

steps just cover the full

range of original weight values, and 

𝑛

is halved as 

𝑛

increased by one. These criteria optimize step sizes when

the weights are uniformly distributed and if the objective is

to minimize distortion in quantized weights.

For LLMs, the elements  of a weight matrix can typically

exhibit a light-tailed distribution 

𝜃

(Gauss, Laplace, etc.)

(Zhao et al., 2019), and partitioning the weight range into

𝐵

𝑛

equal steps is sub-optimal especially at low bit depths

(Cover & Thomas, 2006; Gersho & Gray, 1991). A simple

alternative to non-uniform quantization using the Lloyd–

Max algorithm (Lloyd, 1982; Max, 1960)—which can be

computationally expensive—is “companded” quantization

(Gray & Neuhoff, 1998), in which a sigmoid transform is

applied to  prior to uniform quantization to achieve finer

quantization in regions of high 

𝜃

and coarser quantization

in regions of low 

𝜃

; see Figure 2 (right). When the weights

 are Laplace-distributed with mean  and variance 

, an

asymptotically optimal choice of sigmoid function can be

shown to be (Appendix C):

(, , ) =

1 + sgn(−)

exp



−



2 abs(−)

3



(8)

that is, the normalized cubic root of a Laplace cumulative

distribution function with the mean  and variance 

. The

companded weights 

𝜎

= (, ,  ) can then be quantized

uniformly in the range (0, 1) and communicated along with

bit depth , scale , and mean  for dequantization using

lookup tables. In practice, (, ) can be treated as hyper-

parameters and fine-tuned efficiently on coarse 1D grids in

post-processing (Young et al., 2021) after Algorithm 1 has

completed its course.

Quantization invariably causes deterministic differences to

arise between the original (non-quantized) weights  and

quantized weights 

𝑞

. While these errors are customarily

modeled as zero-mean noise in theoretical analyses, they

are seldom zero-mean empirically, leading to a systematic

Figure 2: Companding quantization. Illustrated here for a 4-bit

quantizer (16 quantization levels) on Gaussian weights with zero

mean and unit variance. Uniform quantization of the entire range

of weight values (left) leads to unduly large quantization bins

(hence quantization errors) for probable weights. Companding (a

sigmoid transformation) of weights to the range (0,1) prior to

quantizing reduces quantization errors for more probable weights

(right), which reduces output distortion.

–2

–4

–4 –2 0 2 4

Quantized value

–2

–4

–4 –2 0 2 4

Original weight value

Quantized value

Radio: Rate–Distortion Optimization for Large Language Model Compression

bias in the model output and reduces prediction accuracy

significantly. To compensate for the non-zero mean of the

quantization errors, one can update the bias vectors for the

model as 

𝑛

𝑞



𝑛

+ (

𝑛

𝑞

−

𝑛

) X

every time the weights



𝑛

are quantized. Here, X

is a vector of running means of

the inputs to the th layer, which is accumulated during the

forward pass in a manner analogous to the accumulation of



𝑛

during the backward pass. The corrected biases 

𝑛

𝑞

are

then used whenever the corresponding quantized matrices



𝑛

𝑞

are used, both during gradient variance accumulation at

compression time as well as inference at test time.

3.3. Grouping Weights

Rather than quantize at the granularity of the whole weight

matrix, we can split each matrix into a collection of row or

column matrices, assigning optimal bit depth and step size

to each group of weights. In this case, the total number of

matrices  in (3) can be reinterpreted as the total number

of weight groups collected across the model, and similarly

reinterpret 

𝑛

, 

𝑛

and 

𝑛

as the bit-depth, step size and the

number of elements of the th weight group. Quantizing at

the granularity of these weight groups does not increase the

complexity of variance accumulation, as the same squared

gradients computed via back-propagation can be averaged

per weight group to produce the corresponding per-group

gradient variances. To analyze the rate–distortion gain, we

assume that each weight group is a column of a matrix.

For a weight matrix  with gradient and weight variances



and 

, whose per-column variances are 

, . . . , 

𝑁

and 

, . . . , 

𝑁

, respectively, the theoretical gain (average

bit depth saving) from weight grouping can be written as



group



log









−





log





𝑛



𝑛



𝑁

𝑛=1



(9)

a non-negative quantity owing to Jensen’s inequality. This

quantity represents the bit-rate (average bit-depth) savings

that can be achieved when the th column is assigned 

𝑛

log

(

𝑛



𝑛

) +  bits for some baseline , compared with

assigning a uniform bit depth 

𝑛

log

(



) +  bits

across all columns when identical distribution is assumed

across the  columns. Figure 3 (left) plots the per-matrix

bit-depth savings derived by grouping the (, ,  and )

projection matrices of the OPT-125m model by its rows or

columns. The sorted per-channel breakdown of the savings

is also shown for the zeroth  proj matrix.

In addition to the grouping of matrices into columns, one

may want to sub-divide each column into a fixed number

of sub-groups of weight rows to exploit the row bit saving

as well. To sub-group the columns of a weight matrix 



𝑁×𝑁

, one can split its rows into  similarly sized groups

based on their total row variances: 



, . . . , 

𝑁



𝑁

. By

applying the same grouping to all columns of a matrix, we

can signal the grouping index for each row using log



bits—a negligible per-weight overhead for typical number

of columns in a large weight matrix and number of groups

used in practice. We illustrate partitioning and subdivision

in Figure 3. In Section 4, we show the accuracy of models

quantized with different numbers of row weight groups to

demonstrate that the grouping mechanism used in the AWQ

and GPTQ also improves rate–distortion characteristics.

4. Quantization Experiments

To study the rate–distortion behavior of a typical quantized

LLM, we apply Algorithm 1 to the quantization of the Meta

Open Pretrained Transformer (OPT) (S. Zhang et al., 2022)

and Llama-2 (Touvron et al., 2023) families of language

models (obtained from the HuggingFace Hub), comparing

the performance of the proposed quantization method with

baselines on next token prediction and question answering

tasks. For calibration data, we source 128 examples from

the training split of the C4 dataset (Raffel et al., 2020). We

test on the test splits of WikiText2 (Merity et al., 2022) and

C4 for next token prediction and those of GSM8K (Cobbe

et al., 2021), ARC (Clark et al., 2018), HellaSwag (Zellers

et al., 2019), PIQA (Bisk et al., 2019), and WinoGrande

(Sakaguchi et al., 2021) for question-answering tasks.

Next Token Prediction. As our first set of experiments, we

quantize the Meta OPT and Llama 2 models to 3 and 4 bits

on average and measure the performance of the quantized

models via perplexity, a stringent accuracy metric. We use

a combined row–column group size of 512 for OPT (768

for 125M, 66B) and 256 for Llama 2 models, batch size of

16, and 17 tokens from each sequence of tokens of length

2048, and optimize for a maximum of 64 iterations. Table

1 lists the perplexity (PPL) of our quantized models on the

test split of WikiText2. We select the final quantized model

based on WikiText2 (best validation) but selecting the last

quantized model produces similar test accuracy (within 1%

of the unquantized model perplexity). For comparison, we

include the perplexities of the same models quantized using

round-to-nearest, GPTQ (Frantar et al., 2022), OWQ (Lee

et al., 2024), AWQ (Lin et al., 2024) as well as QuIP (Chee

Figure 3: Bit saving from grouping for OPT-125m. Saving is

derived by splitting each weight matrix into a collection of row or

column matrices and assigning each sub-matrix its own bit depth.

Savings differ across the (

𝑄, 𝐾, 𝑉 and 𝑂

) projection matrices of

the model’s 12 transformer blocks (left). Per- row bit saving

(right—block 3, 𝑂 -proj) can dip below zero but are always

positive on average owing to Jensen’s inequality.

0 1 2 3 4 5 6 7 8 9 1011

Per-matrix bit savings

Transformer block index

Row

Column

Per-column bit savings

0 192 384 576 768

–2

Sorted column index

Column partition

Average

𝑉 proj (block 0)

Radio: Rate–Distortion Optimization for Large Language Model Compression

et al., 2024) based on the code provided by the respective

authors; see Appendix D for details. Relative to the next

best performing methods, the proposed method provides a

perplexity reduction of up to 4.55 for the 3-bit OPT-125M

model but minor perplexity gains (0.00–0.01) are observed

for the 3-bit OPT-66B and Llama 2 70B models. In this

comparison, AWQ uses a group size of 128, incurring 2–4

times as many overhead bits as the proposed method, and

OWQ, by its nature, operates at average bit depths that are

0.01–0.05 bits higher per weight on average than proposed.

Hyperparameters/Ablations. To analyze the influence of

Radio hyperparameters on the accuracy of the quantized

models, we quantize the OPT-1.3B and -13B models over

a range of minibatch sizes and token counts (optimization

hyperparameters) as well as the group size (quantization

hyperparameter), with each hyperparameter varied while

keeping the others fixed at their optimized values. (The

optimal hyperparameter values are batch size: 16, token

count: 17, and group size: 512.) The perplexity of the

quantized models is then measured on the C4 test data.

Table 2 (a–b) demonstrates that Radio is largely insensitive

to the values of optimization hyperparameters over a wide

range. From Table 2 (c), we see that smaller group sizes

generally improve the performance of the quantized

models at lower average bit depths, but this also leads to

Figure 4: Perplexity across optimization iterations. Calibrated

on C4 (train) using a batch size of 16. Row groups of size 512

used. Perplexity decreases rapidly across the first thirty iterations,

after which iterations can be early terminated. Note monotonicity

for C4 (test), whose distribution is more like the calibration data.

Test Perplexity

0 10 20 30 Iter

WT2

C4 (Full)

OPT-2.7B (4 bits)

WT2 (Full)

0 10 20 30 Iter

OPT-30B (3 bits)

WT2

C4 (Full)

WT2 (Full)

Perplexity (PPL)

WikiText2 (↓)

Meta OPT (Open Pretrained Transformer)

Meta Llama 2

125M

350M

1.3B

2.7B

6.7B

13B

30B

66B

13B

70B

Full Precision (FP16)

27.65

22.00

14.63

12.47

10.86

10.13

9.56

9.34

5.47

4.88

3.32

4 bits

Round-to-nearest (RTN)

37.28

25.94

48.17

16.92

12.10

11.32

10.98

111.36

5.73

4.98

3.46

GPTQ (Frantar et al., 2022)

32.05

23.87

15.47

12.83

11.14

10.29

9.57

9.34

6.07

5.20

3.59

GPTQ/256 (Frantar et al., 2022)

30.53

23.83

14.91

12.52

11.02

10.22

9.60

9.46

5.70

5.02

3.44

QuIP (Chee et al., 2024)

35.93

23.15

15.96

12.67

11.10

10.33

9.60

9.40

5.69

5.06

3.46

OWQ (4.01 bits) (Lee et al., 2024)

29.47

23.19

15.01

12.39

10.87

10.26

9.50

9.25

5.63

5.01

3.43

AWQ/128 (Lin et al., 2024)

29.11

–

14.95

12.74

10.93

10.22

9.59

9.39

5.60

4.97

3.41

OmniQuant/128 (Shao et al., 2024)

28.86

–

14.88

12.65

10.96

10.20

9.62

9.37

5.58

4.95

3.40

SqueezeLLM (Kim et al., 2024)

–

14.94

12.80

11.03

10.24

9.65

–

5.62

4.99

3.41

Radio (4.0000 bits) (Ours)

27.23

22.89

14.20

12.12

10.52

10.08

9.45

9.13

5.57

4.97

3.40

3 bits

RTN

1.3e3

64.57

119.47

298.00

23.54

46.04

18.80

6.1e3

6.66

5.52

3.98

GPTQ

53.43

32.28

20.90

16.55

12.88

11.58

10.29

9.90

9.23

6.69

3.87

GPTQ/256

41.22

29.96

16.98

13.94

11.39

10.41

9.81

11.13

6.75

5.59

4.00

QuIP (Chee et al., 2024)

34.43

26.02

17.33

13.84

12.35

10.57

9.92

9.46

6.36

5.50

3.86

OWQ (3.01 bits) (Lee et al., 2024)

35.26

26.59

16.40

13.21

11.21

11.48

9.59

9.28

6.21

5.36

3.77

AWQ/128

36.77

–

16.32

13.54

11.41

10.67

9.85

9.63

6.24

5.32

3.74

OmniQuant/128 (Shao et al., 2024)

32.25

–

15.71

13.18

11.27

10.47

9.79

9.53

6.03

5.28

3.78

SqueezeLLM (Kim et al., 2024)

–

16.30

13.85

11.70

11.76

10.17

–

6.18

5.36

3.77

Radio (3.0000 bits) (Ours)

30.71

25.94

14.83

12.42

11.07

10.28

9.56

9.24

6.04

5.25

3.72

Table 1: WikiText2 perplexity (test). We quantize the Meta OPT and Llama 2 families of LLMs to 3–4 bits per weight on average via

the proposed quantization method (Radio), reporting the perplexity of each quantized model on WikiText2 (test). For comparison, we

also include the perplexities of models quantized using RTN, GPTQ, QuIP, OWQ, AWQ and SqueezeLLM.

Table 2: Effect of hyperparameters on quantized model accuracy. Quantized model accuracy is relatively insensitive to the minibatch

size (a) and number of tokens per sequence (b) used for the optimization. Smaller groups can improve quantized model accuracy at low

average bit depths (c).

(a) Minibatch size and PPL (b) Number of tokens and PPL (c) Group size and PPL

PPL

C4 (↓)

OPT (4 bits)

OPT (3 bits)

1.3B

13B

1.3B

13B

FP16

16.07

12.06

16.07

12.06

Batch size

16.24

12.12

16.94

12.36

16.24

12.12

16.94

12.35

16.25

12.11

16.90

12.34

16.22

12.11

16.86

12.32

16.24

12.12

16.88

12.36

PPL

C4 (↓)

OPT (4 bits)

OPT (3 bits)

1.3B

13B

1.3B

13B

FP16

16.07

12.06

16.07

12.06

group size

16.16

12.10

16.62

12.26

128

16.17

12.10

16.70

12.29

256

16.20

12.10

16.77

12.32

512

16.22

12.11

16.86

12.32

1024

16.23

12.11

16.99

12.42

PPL

C4 (↓)

OPT (4 bits)

OPT (3 bits)

1.3B

13B

1.3B

13B

FP16

16.07

12.06

16.07

12.06

Num tokens

16.40

12.29

17.05

12.47

16.28

12.18

16.93

12.37

16.24

12.12

16.91

12.35

16.22

12.11

16.86

12.32

16.21

12.10

16.87

12.34

Radio: Rate–Distortion Optimization for Large Language Model Compression

higher overheads (discussed later). Figure 4 plots quantized

model accuracy across optimization iterations in the case

where baseline hyperparameter values are used. It appears

that about 20 iterations are needed for quantization

parameters (bit depth decisions) to reach their optima.

Table 3 (a) shows ablations of our quantized OPT models

by starting with RTN and adding different components

(Jeon et al., 2023). See Table 5 for C4 perplexity results.

Pruning Due to Quantization. Our method quantizes very

low-variance weights of weight matrices to zero and effects

a form of weight pruning, which has been demonstrated to

improve test accuracy (Hassibi & Stork, 1992). Table 3 (b)

lists the percentages of zero-quantized weights in the OPT-

1.3B and 13B models quantized to 3 and 4 bits per weight

on average. We observe that using smaller grouping sizes

increases the number of pruned weights since this enables

lower-variance weights in each column to be grouped

together and quantized to zero. However, smaller groups

lead to higher overheads so that small improvements in

generalization due to pruning come at the cost of signaling

the overhead bits. Table 3 (c) lists the number of overhead

bits (group indices and FP16 encodings of the location and

scale parameters of each weight group) as a percentage of

the total quantized weight bits. These overheads are in line

with those of other algorithms which must similarly signal

zero points and step sizes of the quantization grid (Lee et

al., 2024).

2.x-bit Llama-2. We study the accuracy of Llama 2 models

quantized to 2.x bits using Radio and OWQ, both of which

are capable of quantizing models to fractional average bit

depths. To allow a more comprehensive study, we compare

against OWQ with no grouping, as well as with group sizes

of 128 and 256. We can see from Table 4 (a) that Radio-

quantized Llama-2 models are considerably more accurate

at these bit depths than their OWQ counterparts. This is

expected since Radio assigns bit depths from the range

(0, 

max

) commensurately with gradient variances whereas

OWQ opts to preserve the most sensitive weights in FP16

and quantize the rest to 2 bits (Lee et al., 2024). In terms of

execution time, Radio (for 64 iterations) and OWQ/GPTQ

require 47 minutes and 18 minutes, respectively (excluding

testing), to quantize the 7B model on an Nvidia A100.

Downstream Tasks (Common Sense QA, GSM8K). To

show the impact of model quantization on downstream

tasks, we list in Table 4 (b–c) the accuracy of our quantized

Table 4: 2.x-bit quantization and downstream tasks. Quantized to 2.x bits per weight, Radio reduces perplexity considerably

compared with OWQ models quantized to the same (a). Quantized model accuracy measured by performance on tasks such as GSM8K

(b). Group size of 256 is used.

(a) Perplexity of 2.1–2.8-bit Llama 2 models (b) Scores for 3-bit Llama-2 models on GSM8K and QA

Score (%)

Llama-2 (↑)

GSM8K

Ave r a g e QA

13B

70B

13B

70B

FP16 (Full Precision)

64.83

67.82

72.36

14.10

23.43

53.90

RTN (Round-To-Nearest)

1.82

1.67

6.14

39.32

52.15

58.22

GPTQ/256 (Frantar et al., 2022)

6.60

14.48

46.47

61.40

64.94

70.58

AWQ/256 (Lin et al., 2024)

6.97

16.76

48.07

62.48

65.95

71.29

Radio/256 (3.0000 bits) (Ours)

7.81

18.20

49.81

62.82

66.37

71.87

Perplexity

WikiText2 (↓)

Llama 2 7B (2.1–2.8 bits)

2.1

2.2

2.4

2.6

2.8

FP16 (Full Precision)

5.47

OWQ (Lee et al., 2024)

39.56

11.25

10.79

10.43

10.24

OWQ/256

10.34

10.01

9.98

9.50

9.26

OWQ/128

10.01

9.66

9.42

9.38

9.14

Radio/256 (Ours)

9.47

8.39

7.05

6.56

6.21

Score (%)

Llama-2 (↑)

Arc (Challenge)

Arc (Easy)

HellaSwag

PIQA

Winogrande

13B

70B

13B

70B

13B

70B

13B

70B

13B

70B

FP16 (Full Precision)

43.34

48.38

54.27

76.30

79.42

82.74

57.13

60.04

64.76

78.07

79.05

82.15

69.30

72.22

77.90

RTN (Round-To-Nearest)

20.73

30.63

37.20

34.97

60.65

65.66

31.09

43.73

51.01

57.34

70.40

73.12

52.49

55.33

64.09

GPTQ/256 (Frantar et al., 2022)

38.23

43.34

52.13

72.26

76.64

80.85

53.02

57.65

62.60

75.63

77.48

80.52

67.88

69.61

76.80

QuIP/256 (Chee et al., 2024)

38.74

45.39

52.30

72.77

–

80.68

54.08

76.77

63.04

77.09

78.78

81.12

66.85

70.40

76.72

AWQ/256 (Lin et al., 2024)

41.13

45.05

53.24

73.36

77.95

81.69

54.06

57.83

63.64

75.84

77.26

81.66

68.03

71.67

76.24

Radio/256 (3.0000 bits) (Ours)

41.21

45.73

53.84

72.60

77.95

82.32

53.95

58.55

63.86

77.20

78.51

81.88

69.14

71.11

77.43

Table 3: Ablations and pruning effects of quantization. A small fraction of weights is quantized to zero and pruned away due to low

variance, with smaller groups increasing the degree of pruning (a). Quantization incurs overhead bits for signaling group indices and

location and scale parameters of groups (b).

(a) Ablations (mixed precision and step sizes) (b) Pruned columns (%) (c) Overhead bits (%)

Overhead

bits (%)

Meta OPT (4 bits)

350M

1.3B

13B

30B

Group size

10.33

10.30

10.28

13.70

128

5.18

5.16

5.15

6.86

256

2.60

2.59

2.58

3.44

512

1.30

1.72

1024

0.64

0.65

0.86

Pruned

(%)

Meta OPT (4 bits)

350M

1.3B

13B

Group size

0.57

2.13

2.18

128

0.61

2.19

2.31

256

0.67

2.10

2.16

512

0.68

2.07

2.00

1024

0.68

2.08

1.92

Perplexity

C4 (↓)

OPT (4 bits)

OPT (3 bits)

1.3B

13B

1.3B

13B

RTN (Round-To-Nearest)

24.51

13.36

4.2e3

3.2e3

+ MMSE Step Sizes

16.98

12.26

21.64

13.34

+ Mixed Precision Depths

16.29

12.12

18.48

12.73

+ Companding

16.22

12.11

16.86

12.32

= Radio (Ours)

16.22

12.11

16.86

12.32

Radio: Rate–Distortion Optimization for Large Language Model Compression

Llama-2 models on the ARC-C, -E (Clark et al., 2018),

HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2019)

and Winogrande (Sakaguchi et al., 2021) common sense

question answering, and GSM8K (Cobbe et al., 2021) math

problem solving tasks. We set our group size and the group

size of GPTQ and AWQ to 256. We observe that Radio

produces slightly higher scores than the GPTQ and AWQ

quantized 3-bit models while RTN leads to severely

diminished scores despite having similar perplexity scores

as Radio on WikiText2 (Table 1). We include example

responses to GSM8K questions produced by different 3-bit

quantized Llama-2-70B models in Appendix E.

Timing results. To show the running time behavior of the

proposed quantization method, Table 6 lists running times

of Radio on Llama 2 models of different sizes, measured

on Nvidia A100. It can be seen that Radio’s running time

grows close to linearly with model size. Table 7 lists the

acceleration achieved by our custom quantized matrix-

vector multiply kernel, with acceleration ranging between

1.4 and 3.3 depending on the embedding dimensionality.

5. Discussion

Formulating weight quantization as a convex optimization

problem as we have done here yields two benefits. First, it

explicates the objective we seek to optimize (minimizing

output distortion in this case) and sets us on a path to solve

the right problem using modern automatic differentiation

tools e.g. PyTorch’s autograd library. Second, it enables us

to interpret many earlier Hessian-based methods (Frantar

et al., 2022; Lee et al., 2024; Dong et al., 2019; Chen et al.,

2021) as heuristics for approximate optimization of the true

underlying quantization objective. Note (2) is a non-linear

system of equations in the bit depth variables, so that any

non-iterative solution is necessarily only an approximate

one if one’s goal is to optimize an objective like (2). Recent

high-performance model quantization techniques (Chee et

al., 2024; Frantar et al., 2022; Frantar & Alistarh, 2022; Lee

et al., 2024) ultimately trace their lineage back to the

classic Optimal Brain Surgeon (OBS) algorithm (Hassibi

& Stork, 1992), which is a convex formulation for weight

pruning, not quantization (Appendix F). As a result, these

methods inherit the need for fine-tuning weights as part of

the quantization process, making them less suitable for the

quantization of activations at inference time, where fine-

tuning would lead to unacceptable delays in the inference

pipeline.

Our experimental results demonstrate that a more accurate

Perplexity (PPL)

C4 (↓)

Meta OPT (Open Pretrained Transformer)

Meta Llama 2

125M

350M

1.3B

2.7B

6.7B

13B

30B

66B

13B

70B

Full Precision (FP16)

26.56

22.59

16.07

14.34

12.71

12.06

11.44

10.99

6.97

6.46

5.52

4 bits

Round-to-nearest (RTN)

33.91

16.21

24.51

18.43

14.36

13.36

13.46

283.31

7.86

7.16

6.01

GPTQ (Frantar et al., 2022)

29.42

24.14

16.73

14.85

12.99

12.24

11.56

11.08

7.86

7.06

5.90

GPTQ/256 (Frantar et al., 2022)

28.36

24.18

16.47

14.64

12.88

12.15

11.50

11.12

7.58

6.88

5.79

QuIP (Chee et al., 2024)

27.85

23.39

17.20

14.58

12.87

12.17

11.51

11.03

7.57

6.91

5.80

OWQ (4.01 bits) (Lee et al., 2024)

27.93

23.37

16.49

14.60

12.83

12.17

11.49

11.02

7.59

6.94

5.81

AWQ/128 (Lin et a l., 2024 )

27.79

–

16.42

14.58

12.84

12.15

11.50

11.04

7.44

6.84

5.77

OmniQuant/128 (Shao et al., 2024)

27.59

–

16.38

14.56

12.82

12.16

11.50

11.04

7.12

6.56

5.58

SqueezeLLM (Kim et al., 2024)

–

16.36

14.55

12.82

12.15

11.50

–

7.12

6.57

5.58

Radio (4.0000 bits) (Ours)

27.27

23.20

16.24

14.44

12.79

12.11

11.48

11.01

7.40

6.83

5.76

3 bits

RTN

839.97

55.96

4.2e3

1.1e4

4.4e3

3.2e3

1.1e3

3.5e3

521.22

14.01

11.06

GPTQ

42.64

29.90

20.46

17.48

14.56

13.16

12.14

11.53

11.44

8.98

7.12

GPTQ/256 (Frantar et al., 2022)

35.00

28.84

18.07

15.84

13.50

12.57

11.78

12.29

8.92

7.65

6.21

QuIP (Chee et al., 2024)

31.37

25.58

18.15

15.92

13.66

12.40

11.67

11.16

8.48

7.49

6.10

OWQ (3.01 bits) (Lee et al., 2024)

31.28

26.40

17.69

15.36

13.23

13.29

11.69

11.17

8.59

7.65

6.16

AWQ/128

32.91

–

17.81

15.49

13.34

12.55

11.75

11.26

8.30

7.31

6.04

OmniQuant/128 (Shao et al., 2024)

31.30

–

17.46

15.33

13.28

12.50

11.73

11.22

7.75

6.98

5.85

SqueezeLLM (Kim et al., 2024)

–

17.19

15.62

13.41

13.55

11.85

–

7.72

6.97

5.73

Radio (3.0000 bits) (Ours)

30.05

26.20

16.88

14.91

13.14

12.35

11.62

11.19

8.04

7.22

5.99

Table 5: C4 perplexity (validation). We quantize the Meta OPT and Llama 2 families of LLMs to 3–4 bits per weight on average using

the proposed quantization method, reporting the perplexity of each quantized model on the C4 dataset. For comparison, we also include

the perplexities of models quantized using RTN, GPTQ, QuIP, OWQ, and AWQ.

Quantization runtimes with 128

examples from C4 (seq length 2048)

Meta Llama 2

13B

70B

Full Precision (FP16)

3 bits

RTN

<1m

GPTQ/256 (Frantar et al., 2022)

10m

18m

90m

QuIP (Chee et al., 2024)

36m

80m

OWQ (3.01 bits) (Lee et al., 2024)

28m

50m

AWQ/128

17m

87m

OmniQuant/128 (Shao et al., 2024)

56m

132m

15h

SqueezeLLM (Kim et al., 2024)

11m

23m

92m

Radio (3.0000 bits) (Ours)

47m

97m

11h

Table 6: Radio running times. We quantize the Meta Llama 2

family of LLMs to ~3 bits per weight on average and measure the

running time of the proposed method. We also include the running

times of GPTQ, QuIP, OWQ, AWQ, and SqueezeLLM.

Radio: Rate–Distortion Optimization for Large Language Model Compression

characterization of the LLM quantization problem can lead

to better compression outcomes. While the smaller OPT-

125M model is too limited for practical usefulness in most

situations, its relative incompressibility helps contrast the

performance of the different weight quantization methods

themselves (Table 1). With larger models like OPT-66B

and Llama 2-66B, most quantization techniques (including

RTN) perform similarly, suggesting that larger language

models are more compressible generally. At first glance, a

RTN may seem sufficient for quantizing larger models, but

upon a closer look, RTN-quantized models lead to severely

reduced accuracy on downstream tasks such as GSM8K

(Table 4 (a)), which highlights the importance of validating

the accuracy of quantized models across multiple tasks and

datasets (Jaiswal et al., 2024). Interestingly, increasing the

number of calibration examples (from 128 to 1024) did not

noticeably affect the quantized model’s perplexity on C4 (±

0.01), which agrees with findings from previous works; see

(Kim et al., 2024; Hubara et al., 2021). We will discuss the

quantization of activations and more advanced methods for

model quantization in the sequel.

Finally, our CUDA matmul kernel (Appendix A) provides

acceleration for matrix-vector multiplies by dequantizing

mixed precision weights to a floating-point representation

(FP16) dynamically then multiplying them by activations

of the same representation. For the 12288 × 49152 weight

matrix of Meta’s OPT-175B quantized to 3 bits per weight

on average, our custom CUDA kernel leads to a 3.8x speed

up over the FP16 matrix-vector multiply performed using

the default cuBLAS matmul on Nvidia A6000. Accelerated

matrix-vector multiply and low quantization complexity of

our quantization approach allows us to apply Radio also to

the quantization of batched activations, where quantization

efficiency becomes paramount.

6. Conclusion

Here, we showed that a rate–distortion framework can lead

to better LLM quantization outcomes. Despite numerous

advances in methods for LLM compression, there has not

been an extensive study in the rate–distortion-theory

aspects of model quantization and optimization techniques

required to solve the resulting optimization problem. This

work fills the gap in the current literature by introducing

the rate–distortion framework and a stochastic numerical

solver for the rate–distortion optimization problem.

Reproducibility Statement

To ensure the reproducibility of results in this work, we

make our PyTorch Radio program available on our GitHub

project website, where readers can also ask questions about

this work. Appendix A lists our CUDA kernel. Appendices

B–C provide derivations for our main theoretical results

and Appendix D additionally details the PyTorch code and

command line options used to obtain the results of GPTQ

(Frantar et al., 2022), OWQ (Lee et al., 2024), and AWQ

(Lin et al., 2024).

Impact Statement

This paper aims to advance the theory of rate and distortion

for LLM compression. There are many potential societal

consequences of our work, none which we feel needs to be

specifically highlighted here.

References

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan

Alistarh, Torsten Hoefler, and James Hensman. QuaRot:

Outlier-Free 4-Bit Inference in Rotated LLMs. In Proc.

NeurIPS, 2024.

Mart van Baalen, Andrey Kuzmin, Markus Nagel et al.

GPTVQ: The blessing of dimensionality for LLM

quantization.

https://doi.org/10.48550/arXiv.2402.15319, 2024.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and

Jonathan Eckstein. Distributed optimization and

statistical learning via the alternating direction method

of multipliers. Fou nd. Tren ds® Ma ch . L ear n., 3(1):1–

122, 2011.

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher

De Sa. QuIP: 2-bit quantization of large language

models with guarantees. In Proc. NeurIPS, 2023.

Weihan Chen, Peisong Wang, and Jian Cheng. Towards

mixed-precision quantization of neural networks via

constrained optimization. In Proc ICCV, 2021.

Kanghyun Choi, Deokki Hong, Noseong Park, Youngsok

Kim, and Jinho Lee. Qimera: Data-free quantization

with synthetic boundary supporting samples. In Proc.

NeurIPS, 2021.

Yo oj in Ch oi , Mo st af a E l-Khamy, and Jungwon Lee.

Towards the limits of network quantization. In Proc.

ICLR, 2017.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,

Model (Embedding)

Acceleration factors

𝐸→𝐸

𝐸→4𝐸

4𝐸→𝐸

Overall

3 bits

OPT-1.3B (𝐸 = 1024)

0.9

2.1

2.7

1.4

OPT-1.3B (𝐸 = 2048)

2.7

2.3

2.5

2.7

OPT-6.7B (𝐸 = 4096)

2.4

3.1

2.8

OPT-30B (𝐸 = 7168)

3.2

3.1

3.2

OPT-66B (𝐸 = 9216)

3.2

3.1

3.2

OPT-175B (𝐸 = 12288)

3.2

3.8

3.3

Table 7: Acceleration due to quantized mat-vector multiplies

relative to multiplication in FP16. A 3-bit weight matrix of

dimension 𝑁 × 𝑀

multiplies a vector of length

𝑀 to produce a

vector of length 𝑁 (denoted 𝑀 → 𝑁).

Radio: Rate–Distortion Optimization for Large Language Model Compression

Ashish Sabharwal, Carissa Schoenick, and Oyvind

Tafjord. Think you have Solved Question Answering?

Try ARC, the AI2 Reasoning Challenge.

http://arxiv.org/abs/1803.05457, 2018.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian et al.

Training verifiers to solve math word problems.

http://arxiv.org/abs/2110.14168, 2021.

Thomas M. Cover, and Joy A. Thomas. Elements of

Information Theory (Wiley Series in

Telecommunications and Signal Processing). Wiley-

Interscience, USA 2006.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke

Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication

for transformers at scale. In Proc. NeurIPS, 2022.

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian et al.

SpQR: A Sparse-Quantized Representation for near-

lossless LLM weight compression.

http://arxiv.org/abs/2306.03078, 2023.

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W.

Mahoney, and Kurt Keutzer. HAWQ: Hessian AWare

Quantization of neural networks with mixed precision.

In Proc. ICCV, 2019.

Va ge E gi az ar i an , An dr ei Pa nf er ov, D e ni s Ku zn ed el ev,

Elias Frantar, Artem Babenko, and Dan Alistarh.

Extreme compression of large language models via

additive quantization. In Proc. ICML, 2024.

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani,

Rathinakumar Appuswamy, and Dharmendra S. Modha.

Learned step size quantization. In Proc. ICLR, 2019.

Elias Frantar, and Dan Alistarh. Optimal Brain

Compression: A framework for accurate post-training

quantization and pruning. In Proc. NeurIPS, 2022.

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan

Alistarh. OPTQ: Accurate quantization for generative

pre-trained transformers. In Proc. ICLR, 2022.

Allen Gersho, and Robert M. Gray. Vector Quantization

and Signal Compression. Kluwer, Norwell, MA, USA

1991.

Yu nc ha o G o ng , L i u L i u, Mi n g Ya ng , a nd L ub om ir B ou rd ev.

Compressing deep convolutional networks using vector

quantization. In Proc. ICLR, 2015.

R.M. Gray, and D.L. Neuhoff. Quantization. IEEE Trans.

Inf. Theory, 44(6):2325–2383, 1998.

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai

Wong, and Hao Yu. APTQ: Attention-aware post-

training mixed-precision quantization for large

language models. In Proc. DAC, 2024.

Babak Hassibi, and David Stork. Second order derivatives

for network pruning: Optimal Brain Surgeon. In Proc.

NIPS, 1992.

Lu Hou, and James T. Kwok. Loss-aware weight

quantization of deep networks. In Proc. ICLR, 2018.

Wei Huang, Haotong Qin, Yangdong Liu et al. SliM-LLM:

Salience-driven mixed-precision quantization for large

language models, https://arxiv.org/abs/2405.14917v1,

2024.

Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and

Daniel Soudry. Accurate post training quantization with

small calibration sets. In Proc. ICML, 2021.

Benoit Jacob, Skirmantas Kligys, Bo Chen et al.

Quantization and training of neural networks for

efficient integer-arithmetic-only inference. In Proc.

CVPR, 2018.

Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang,

Zhangyang Wang, Yinfei Yang. Compressing LLMs: the

truth is rarely pure and never simple. In Proc. ICLR, 2024.

Yo ngk we on J eo n, C hu ng ma n L ee , Kyu ng ph il Pa rk , Ho -

young Kim. A Frustratingly Easy Post-Training

Quantization Scheme for LLMs. In Proc. EMMLP, 2023.

Sehoon Kim et al. SqueezeLLM: Dense-and-sparse

quantization. In Proc. ICML, 2024.

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim,

and Eunhyeok Park. OWQ: Outlier-aware weight

quantization for efficient fine-tuning and inference of

large language models. In Proc. AAAI, 2024.

Ji Lin, Jiaming Tang, Haotian Tang et al. AWQ: Activation-

aware Weight Quantization for on-device LLM

compression and acceleration. In Proc. MLSys, 2024.

S. Lloyd. Least squares quantization in PCM. IEEE Trans.

Inf. Theory, 28(2):129–137, 1982.

J. Max. Quantizing for minimum distortion. IRE Trans. Inf.

Theory, 6(1):7–12, 1960.

Stephen Merity, Caiming Xiong, James Bradbury, and

Richard Socher. Pointer sentinel mixture models. In

Proc. ICLR, 2022.

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and

Max Welling. Data-free quantization through weight

equalization and bias correction. In Proc. CVPR, 2019.

Yu ry Na hs h an , B r ia n C hm i el , C h ai m B as k in , E v ge ni i

Zheltonozhskii, Ron Banner, Alex M. Bronstein, and

Avi Mendelson. Loss aware post-training quantization.

In Mach Learn 110 3245–3262, Springer, 2020.

Jorge Nocedal, and Stephen J. Wright. Numerical

Optimization. Springer, New York, NY, USA 2009.

Biao Qian, Yang Wang, Richang Hong, and Meng Wang.

Adaptive data-free quantization. In Proc. CVPR, 2023.

Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele.

Adaptive loss-aware quantization for multi-bit networks.

In Proc. CVPR, 2020.

Radio: Rate–Distortion Optimization for Large Language Model Compression

Colin Raffel, Noam Shazeer, Adam Roberts et al.

Exploring the limits of transfer learning with a unified

text-to-text transformer. J. Mach. Learn. Res.,

21(140):1–67, 2020.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula,

and Yejin Choi. WinoGrande: an adversarial Winograd

schema challenge at scale. Commun. ACM, 64(9):99–

106, 2021.

Wenqi Shao et al. OmniQuant: Omnidirectionally

calibrated quantization for large language models. In

Proc. ICLR, 2024.

Hugo Touvron, Louis Martin, Kevin Stone et al. Llama 2:

Open foundation and fine-tuned chat models.

http://arxiv.org/abs/2307.09288, 2023.

Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao.

Improving the speed of neural networks on CPUs. In

Proc. NIPS Wor ksh op s, 2011.

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han.

HAQ: Hardware-aware automated quantization with

mixed precision. In Proc. CVPR, 2019.

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien

Demouth, and Song Han. SmoothQuant: Accurate and

efficient post-training quantization for large language

models. In Proc. ICML, 2023.

Shoukai Xu, Haokun Li, Bohan Zhuang, Jing Liu, Jiezhang

Cao, Chuangrun Liang, and Mingkui Tan. Generative

low-bitwidth data free quantization. In Proc. ECCV,

2020.

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang,

Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant:

Efficient and affordable post-training quantization for

large-scale transformers. In Proc. NeurIPS, 2022.

Yo nat an Bi s k, R owa n Z el l ers , R on an L e B ra s, J ia nf en g

Gao, and Yejin Choi. PIQA: Reasoning about physical

commonsense in natural language.

http://arxiv.org/abs/1911.11641, 2019.

Sean I. Young, Wang Zhe, David Taubman, and Bernd

Girod. Transform quantization for CNN compression.

IEEE Trans. Pattern Anal. Mach. Intell., 44(9):5700–

5714, 2019.

Zhihang Yuan, Yuzhang Shang, Yang Zhou et al. LLM

inference unveiled: Survey and roofline model insights.

https://arxiv.org/abs/2402.16363, 2024.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi,

and Yejin Choi. HellaSwag: Can a Machine Really

Finish Your Sentence? http://arxiv.org/abs/1905.07830,

2019.

Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and

Gang Hua. LQ-Nets: Learned quantization for highly

accurate and compact deep neural networks. In Proc.

ECCV, 2018.

Susan Zhang, Stephen Roller, Naman Goyal et al. OPT:

Open Pre-trained Transformer Language Models.

http://arxiv.org/abs/2205.01068, 2022.

Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and

Zhiru Zhang. Improving neural network quantization

without retraining using outlier channel splitting. In

Proc. ICML, 2019.

Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong

Chen. Incremental Network Quantization: towards

lossless CNNs with low-precision weights. In Proc.

ICLR, 2017.

Radio: Rate–Distortion Optimization for Large Language Model Compression

A. Radio Kernel for Matrix-Vector Multiplication

For completeness, we provide here a reference implementation for multiplication between a mixed-precision quantized

matrix and full-precision vector multiplication. Here, we assign a single bit depth to each group of 4 rows, leading to e.g.

12288 different bit depths in the case of the 49152 × 12288 weight matrix (MLP layer) of the OPT-175B model. Consider

a thread block size of 256 × 256 . Each of 256 × 256 block, in turn, entails 1 × 256 threads, with each thread

dequantizing a 256 ×1 column of weights and multiplying them with the matching 1 × 256 segment of a vector input.

Bit depth changes every 4 rows but every thread will go through the same bit depth change in the course of multiplication,

allowing divergence-free (and uniform memory access) operations.

__constant__ float lutable[256] = { DEQUANT }; // dequantized values defined in macros.h

template <typename scalar_t>

__global__ void VecQuant3MatMulKernel(

const scalar_t* __restrict__ vec,

const int* __restrict__ mat,

scalar_t* __restrict__ mul,

const uint8_t* __restrict__ depths,

const scalar_t* __restrict__ scales,

const int* __restrict__ i_s,

const uint8_t* __restrict__ shifts,

int height,

int width) {

int row = BLOCKHEIGHT * blockIdx.x;

int col = BLOCKWIDTH * blockIdx.y + threadIdx.x;

__shared__ scalar_t blockvec[BLOCKWIDTH];

__shared__ scalar_t lut[BLOCKWIDTH];

blockvec[threadIdx.x] = scales[threadIdx.x / 4] * vec[(row / BLOCKHEIGHT) * BLOCKWIDTH +

threadIdx.x];

lut[threadIdx.x] = lutable[threadIdx.x];

__syncthreads();

scalar_t res = 0;

int i = i_s[blockIdx.x * gridDim.y + blockIdx.y] + threadIdx.x;

// int i = width * row + col;

int shift = shifts[blockIdx.x * gridDim.y + blockIdx.y];

uint64_t tmp_curr;

uint32_t tmp_read;

uint32_t depth_;

int j = 0, k = 0;

tmp_read = reinterpret_cast<const uint32_t*>(mat)[i];

tmp_curr = static_cast<uint64_t>(tmp_read) << 32;

shift += 32;

i += width;

while (k < BLOCKWIDTH) {

depth_ = reinterpret_cast<const uint32_t*>(depths)[j];

int depth, bmask;

uint32_t index;

scalar_t szero, *table;

for (int d = 0; d < 32; d += 8) { // for each of the 4 depth groups (represented in 8

bits)

depth = (depth_ >> (d + 0)) & 7;

bmask = (1 << depth) - 1;

Radio: Rate–Distortion Optimization for Large Language Model Compression

szero = (static_cast<int>((depth_ >> (d + 3)) & 31) - 16) * 0.03125f;

table = reinterpret_cast<scalar_t*>(lut + (1 << depth));

if (shift + 4 * depth > 64) { // will run out of bits, read more

tmp_read = reinterpret_cast<const uint32_t*>(mat)[i];

tmp_curr = static_cast<uint64_t>(tmp_read) << 32 |

static_cast<uint64_t>(tmp_curr) >> 32;

shift -= 32;

i += width;

}

index = (static_cast<uint32_t>(tmp_curr >> shift) & bmask);

res += blockvec[k + 0] * (szero + table[index]);

shift += depth;

index = (static_cast<uint32_t>(tmp_curr >> shift) & bmask);

res += blockvec[k + 1] * (szero + table[index]);

shift += depth;

index = (static_cast<uint32_t>(tmp_curr >> shift) & bmask);

res += blockvec[k + 2] * (szero + table[index]);

shift += depth;

index = (static_cast<uint32_t>(tmp_curr >> shift) & bmask);

res += blockvec[k + 3] * (szero + table[index]);

shift += depth;

k += 4;

}

j += 1;

}

atomicAdd(&mul[col], res);

}

Radio: Rate–Distortion Optimization for Large Language Model Compression

B. Derivation of Equation (5)

To derive our main equation (5), we appeal to a linearized relationship between model weights and output, as well as

standard results from rate–distortion theory (Gersho & Gray, 1991) that relate the quantization error of a random source to

output distortion at a high bit depth, where the linearized model relationship is a good approximation. Let us start with our

quantization objective

(

, 

, . . . , 

𝑁

) = 

𝐗



(, 

𝑞

(

), 

𝑞

(

), . . . , 

𝑁

𝑞

(

𝑁

)) −()



𝐹

(10)

in which () = (, 

(

), 

(

), . . . , 

𝑁

(

𝑁

)) denotes the output of the unquantized model given input . We can

write the residual and Jacobian of  at (, 

𝑞

(

), 

𝑞

(

), . . . , 

𝑁

𝑞

(

𝑁

)) as

(, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

) = (

, . . . , 

𝑀

)(, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

) = (, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

) −()

(11)

(, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

) =





(

, 

𝑞

, . . . , 

𝑁

𝑞

)





(

, 

𝑞

, . . . , 

𝑁

𝑞

)



, . . . ,



(

, 

𝑞

, . . . , 

𝑁

𝑞

)



𝑁



and proceed to write the gradient and Hessian of the objective (10) in terms of the  and above as

 (, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

) = (

𝑇

)(, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

)

(12)





(

, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

)

(



𝑇



)(

, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

)



(



𝑚





𝑚

)(

, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

)

𝑀

𝑚=1











≈ 0

in which the second term of 

 is approximately zero either because the residuals 

𝑚

are relatively small, or they are

close to affine in (

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

) so that 



𝑚

are relatively small, which is the case in the vicinity of the solution.

Using (12), we can now express the local quadratic approximation of (10) about (

, . . . , 

𝑁

) as





(

, . . . , 

𝑁

) =

(a)



𝐗







𝑞

(

), . . . , 

𝑁

𝑞

(

𝑁

)





(

𝑇

)(, 

𝑞

, . . . , 

𝑁

𝑞

)







𝑞

(

), . . . , 

𝑁

𝑞

(

𝑁

)



𝑇



+ 

𝐗







(

), . . . , 

𝑁

(

𝑁

)



𝑇



(

𝑇

)(, 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

)













= 0

(13)

(b)





𝐗



(

𝑇

)

𝑛𝑛

(, 

𝑞

, . . . , 

𝑁

𝑞

)









𝑛

(

𝑛

)



𝑁

𝑛=1

(c)





𝑛



𝑛



𝑛



𝑛

−2𝐵

𝑛











= 𝑑

𝑛

(𝐵

𝑛

)

𝑁

𝑛=1

in which the zero expectation of the linear term in (a) follows from the zero means of quantization errors 

, . . . , 

𝑁

, (b)

follows from the uncorrelatedness of 

, . . . , 

𝑁

, and (c) follows from our definition of gradient variance 

𝑛



𝑛

−1



𝐗



(

𝑇

)

𝑛𝑛

(, 

𝑞

, . . . , 

𝑁

𝑞

)



together with the result from rate–distortion theory (Gersho & Gray, 1991) that relates

the variance of random quantization error 





𝑛

(

𝑛

)



= 

𝑛



𝑛

−2𝐵

𝑛

to the variance 

𝑛

of the random source, and the

coefficient 

𝑛

, and bit depth 

𝑛

of quantization. Expression (5) for the partial derivatives of  with respect to 

𝑛

follows

directly from the properties of the derivative of an exponential.

Since (10) is a non-linear least-squares objective and its gradient depends on the gradient variances 

, 

, . . . , 

𝑁

, its

minimization requires an iterative update of 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

via the choice of 

, 

, . . . , 

𝑁

and re-evaluation of the

gradient variances 

, 

, . . . , 

𝑁

at 

𝑞

, 

𝑞

, . . . , 

𝑁

𝑞

. This is similar to the local Hessian evaluated by the Gauss–Newton

method (Nocedal & Wright, 2006) every time the descent direction is re-computed. One can think of 

, 

, . . . , 

𝑁

the diagonal elements of a non-diagonal Hessian matrix used in e.g. the Gauss–Newton method, but whose off-diagonal

elements disappear in the expectation due to multiplication by uncorrelated quantization errors 

𝑞

, . . . , 

𝑁

𝑞

C. Derivation of Equation (8)

To derive our sigmoid companding function (8), we turn to results from rate–distortion theory that relate the mean square

error of quantization of weights  to the density 

𝜃

of  and the density () of quantization levels, where 2

𝐵



() d

𝑏

𝑎

expresses the number of quantization levels of a -bit quantizer within any interval

[

, 

]

. Writing 

𝑖

for the th

quantization cell and () for the width of the cell containing , we can write the mean square error of quantized weights

Radio: Rate–Distortion Optimization for Large Language Model Compression

|−

𝑞





[



𝑖

]

𝐵

𝑖=1



[

|−

𝑖

𝑞

|

𝑖

]

(14)



(

)





[



𝑖

]



𝑖

𝐵

𝑖=1



(

)





𝜃

(



)



(



)

d



(c)

2𝐵





𝜃

()



−2

(



)

d

in which (a) follows from our assumption that weight distribution is approximately uniform within each quantization cell,

(b) follows from an integral approximation of the finite sum, and (c) follows from the relationship 2

𝐵



−1

() = (), all of

which hold approximately when  is sufficiently large.

To find the density  of quantization levels that leads to the minimum quantization error when 𝜃 has density 

𝜃

, we use

Hölder’s inequality: 

𝜃

1/3

(

𝜃



−2

)

1/3

()

2/3

. Since = 1, we have that 

𝜃



−2

(

𝜃

1/3

)

, which sets a lower

bound on the last term of (14). This lower bound and hence minimum quantization error is attained iff 

𝜃



−2

. The

optimal density for quantization levels is therefore given by



(



)



𝜃

1/3

(



)

⟺ 

−1

(



)



𝜃

1/3

(



)

(15)

Rather than optimize the density  to minimize the quantization error for a given 

𝜃

, we could equivalently transform the

weights  as 

𝜎

= () via a non-linear , so that uniform quantization applied to 

𝜎



𝜃

𝜎

leads to the same minimum

quantization error. The width () of non-uniform quantization cells quantizing  relates to the width of uniform

quantization cells of the companded (transformed) weights 

𝜎

= () as

d() =

d

()



𝜃

1/3

()d⟹ 

′

() 

𝜃

1/3

(),

(16)

in which the first proportionality follows from (15). We can find the optimal nonlinear transform  by integrating 

𝜃

1/3

()

and normalizing (for convenience) the range of the integral to

[

0, 1

]

() =







𝜃

1/3

() d

∞

−∞



−1







𝜃

1/3

() d

𝜃

−∞



(17)

(Gersho & Gray, 1991). Finally, we obtain (8) by substituting the expression for the density of a Laplace distribution

(parameterized by mean  and standard deviation ) into 𝑝

above. Transform  is asymptotically optimal as  in

(14).

D. Algorithm Parameters

To aid the reproducibility of the results in Table 1, we document the code we used for all algorithms (RTN, GPTQ, OWQ,

and AWQ) along with the command line arguments.

RTN. We use the OWQ code from https://github.com/xvyaward/owq/tree/03cfc99 in the provided owq conda

environment. In the case of e.g. Llama-2-7b-hf quantized to 3 bits, we run python main.py meta-llama/Llama-2-

7b-hf c4 --wbits 3 --nearest.

GPTQ. We use the OWQ code from https://github.com/xvyaward/owq/tree/03cfc99 in the provided owq

conda environment. In the case of e.g. Llama-2-7b-hf quantized to 3 bits, we run the provided command python main.py

meta-llama/Llama-2-7b-hf c4 --wbits 3. For results based on the group size of 256, we run python main.py

meta-llama/Llama-2-7b-hf c4 --wbits 3 –groupsize 256.

OWQ. We use the OWQ code from https://github.com/xvyaward/owq/tree/03cfc99 in the provided owq conda

environment. In the case of e.g. Llama-2-7b-hf quantized to 3.01 bits, we run the provided command python main.py

meta-llama/Llama-7b-hf c4 --wbits 3 --target_bit 3.01.

AW Q . We use the AWQ code https://github.com/mit-han-lab/llm-awq/tree/3665e1a in the provided awq

conda environment. In the case of e.g. Llama-2-7b-hf quantized to 3 bits, we run the provided command python -m

awq.entry –model_path meta-llama/Llama-7b-hf --w_bit 3 --q_group_size 128 --run_awq --tasks

Radio: Rate–Distortion Optimization for Large Language Model Compression

wikitext.

E. Output Produced By Different Quantized Models

Table 8 lists output produced by different quantized Llama-2-70b models in response to questions taken from the GSM8K

dataset. For each question, a prompt is created by prepending the question text with five other question and target pairs

from the dataset (known as a 5-shot evaluation). This allows the model to establish a context for the required output and

format. It is interesting to note that severe quantization errors (as in the case of RTN) manifest as non sequiturs and errors

in logic rather than unintelligible output.

F. Convex Weight Pruning (Hassibi & Stork, 1992)

To facilitate comparison between convex weight quantization (this work) and the convex weight pruning work of Hassibi

& Stork (1992), we provide a derivation of Hassibi & Stork’s Optimum Brain Surgeon (OBS) algorithm (presented slightly

differently), together with our commentary for additional clarification.

For simplicity, let us rewrite model (4) as 

(

 , 

, 

, . . . , 

𝑁

)

= (  , ), where  is a vector of all model weights

across different layers of the model. The objective of convex weight pruning is to set some number of elements of  to

zero while fine-tuning the remaining elements to minimize the difference between the output of the pruned model (  , 

𝑝

)

and the output of the unpruned model (  , ). Writing the pruned weights as 

𝑝

= + 

𝑝

, where 

𝑝

is a vector of

updates to be made to weights , it is apparent that 

𝑖

𝑝

= −

𝑖

if the th weight is to be pruned, otherwise 

𝑖

𝑝

should be

chosen to maximally compensate for the effect of other pruned weights on the output. Suppose we have decided to prune

the th element of . The updated set of weights 

𝑝

can be found by solving

minimize (

𝑝

) = 

𝐗



(, + 

𝑝

) −()





𝐗





𝑝𝑇

(

𝑇

)(, )

𝑝



(18)

subject to (

𝑝

) = 

𝑝

𝑇



𝑝

−

𝑝

= 0

in which (, ) represents the Jacobian of (, ) with respect to , and 

𝑝

𝑇

is an operator that picks out the th element

of a vector. The Lagrangian of this problem becomes

(

𝑝

, ) =



𝐗





𝑝𝑇

(

𝑇

)(, )

𝑝



+ (

𝑝

𝑇



𝑝

−

𝑝

)

(19)

in which  represents the dual variable associated with the equality constraint 

𝑝

𝑇



𝑝

−

𝑝

= 0.

To solve (18), we differentiate  with respect to 

𝑝

,  and set all obtained derivatives equal to 0 to obtain the first-order

optimality conditions 

𝐗



(

𝑇

)(, )





𝑝

+ 

𝑝

=  and 

𝑝

𝑇



𝑝

−

𝑝

= 0. After

some algebraic manipulations, we obtain the optimizing values



𝑝

= −

𝐗



(

𝑇

)(, )



−1



𝑝

, = −



𝑝



𝐗



(

𝑇

)(, )



𝑝𝑝

−1

(20)

in which the expression for  is obtained by substituting the expression for 

𝑝

above into the second optimality condition



𝑝

𝑇



𝑝

−

𝑝

= 0 and solving for . Combining both expressions ﬁnally produces an update 

𝑝

that minimizes the objective

in (18):



𝑝

= −



𝑝



𝐗



(

𝑇

)(, )



𝑝𝑝

−1



𝐗



(

𝑇

)(, )



−1



𝑝

, 𝑑(

𝑝

) =



𝑝



𝐗



(

𝑇

)(, )



𝑝𝑝

−1

(21)

So far, we assumed that we were given the index  of the weight to prune from . To actually pick the best weights to

prune away, we can compute the pruning loss (

𝑖

) for all indices , picking the index  associated with minimum loss.

That is,

= argmin

𝑖



𝑖



𝐗



(

𝑇

)(, )



𝑖𝑖

−1

(22)

after which 

𝑝

can be calculated (and consequently 

𝑝

), and further weights to prune can be picked by initializing 



𝑝

and repeating the process until some pruning criterion has been met.

Radio: Rate–Distortion Optimization for Large Language Model Compression

Tabl e 8: Output to questions from the GSM8K dataset (test) by 3-bit-quantized Llama-2 70B. Scores computed across 1319 answers obtained using the flexible extract filter.

Question

Target

FP16 (Score: 53.90%)

RTN (Score: 6.14%)

GPTQ (Score: 46.47%)

Ours (Score: 48.60%)

Question: Every day, Wendi

feeds each of her chickens

three cups of mixed chicken

feed, containing seeds,

mealworms and vegetables to

help keep them healthy. She

gives the chickens their feed

in three separate meals. In the

morning, she gives her flock

of chickens 15 cups of feed.

In the afternoon, she gives

her chickens another 25 cups

of feed. How many cups of

feed does she need to give

her chickens in the final meal

of the day if the size of

Wendi’s flock is 20

chickens?\nAnswer:

If each chicken eats 3 cups of

feed per day, then for 20

chickens they would need

3*20=<<3*20=60>>60 cups

of feed per day.\nIf she feeds

the flock 15 cups of feed in

the morning, and 25 cups in

the afternoon, then the final

meal would require 60-15-

25=<<60-15-25=20>>20

cups of chicken feed.\n####

Wendi feeds each chicken 3

cups of feed per day.\nShe

feeds her flock 15 cups of

feed in the morning and 25

cups of feed in the

afternoon.\nThe total amount

of feed she gives her flock is

15 cups + 25 cups =

<<15+25=40>>40

cups.\nSince each chicken

gets 3 cups of feed per day,

the total number of chickens

in her flock is 40 cups / 3

cups =

<<40/3=13.33333333333333

33333333333333333333333

3333333333333333333

15 cups + 25 cups + 10 cups

= <<15+25+10=50>>50 cups

of feed.\n#### 50\n

15 cups in the morning + 25

cups in the afternoon =

<<15+25=40>>40 cups of

feed\nEach chicken gets 3

cups of feed, so 40 cups of

feed / 3 cups per chicken =

<<40/3=13.33>>13.33

chickens\nSince there are 20

chickens, the number of cups

of feed in the final meal is 20

- 13.33 = <<20-

13.33=6.67>>6.67

cups\n#### 6.67\n\n

15 cups + 25 cups =

<<15+25=40>>40 cups of

feed\n20 chickens x 3 cups =

<<20*3=60>>60 cups of

feed\n60 cups - 40 cups =

<<60-40=20>>20 cups of

feed\n#### 20\n\n

Question: A robe takes 2

bolts of blue fiber and half

that much white fiber. How

many bolts in total does it

take?\nAnswer:

It takes 2/2=<<2/2=1>>1 bolt

of white fiber\nSo the total

amount of fabric is

2+1=<<2+1=3>>3 bolts of

fabric\n#### 3

2 + 2/2 = <<2+2/2=3>>3

bolts\n#### 3\n\n

2 blue bolts + 2 white bolts =

<<2+2=4>>4 bolts in

total.\n#### 4\np

2 + 1 = <<2+1=3>>3

bolts\n#### 3\n\n

2 + 2/2 = <<2+2/2=3>>3

bolts\n#### 3\n\n

Question: Josh decides to try

flipping a house. He buys a

house for $80,000 and then

puts in $50,000 in repairs.

This increased the value of

the house by 150%. How

much profit did he

make?\nAnswer:

The cost of the house and

repairs came out to

80,000+50,000=$<<80000+5

0000=130000>>130,000\nHe

increased the value of the

house by

80,000*1.5=<<80000*1.5=1

20000>>120,000\nSo the

new value of the house is

120,000+80,000=$<<120000

+80000=200000>>200,000\n

So he made a profit of

200,000-

130,000=$<<200000-

130000=70000>>70,000\n##

## 70000

The house is now worth

150% of $80,000, which is

$80,000 * 150/100 =

<<80000*150/100=120000>

>$120,000.\nSo he made

$120,000 - $80,000 - $50,000

= <<120000-80000-

50000=20000>>$20,000.\n##

## 20000\n\n

Josh bought the house for

$80,000 and spent $50,000

on repairs. So he spent

$80,000 + $50,000 =

$<<80,000+50,000=130,000

>>$130,000 on the

house.\nSo Josh spent

$130,000 on the

house.\n#### $130,000\n

Josh bought the house for

$80,000 and put in $50,000

in repairs.\nThe value of the

house increased by 150%, so

the new value is

1.5*$80,000+$50,000 =

$<<1.5*80,000+50,000=180,

000>>180,000.\nSo Josh

made $180,000 - $80,000 -

$50,000 = $<<180,000-

80,000-

50,000=50,000>>50,000 in

profit.\n#### 50,000\n\n

The value of the house

increased by 150%, so it’s

now 150% of $80,000, which

is $120,000.\nJosh spent

$80,000 + $50,000 =

<<80000+50000=130000>>$

130,000 on the house, so he

made $120,000 - $130,000 =

<<120000-130000=-

10000>>-$10,000.\n#### -

10000\n\n

Radio: Rate–Distortion Optimization for Large Language Model Compression

Kylar went to the store to buy

glasses for his new

apartment. One glass costs

$5, but every second glass

costs only 60% of the price.

Kylar wants to buy 16

glasses. How much does he

need to pay for them?",

"answer

The discount price of one

glass is 60/100 * 5 =

$<<60/100*5=3>>3.\nIf

every second glass is cheaper,

that means Kylar is going to

buy 16 / 2 = <<16/2=8>>8

cheaper glasses.\nSo for the

cheaper glasses, Kylar is

going to pay 8 * 3 =

$<<8*3=24>>24.\nAnd for

the regular-priced glasses,

Kylar will pay 8 * 5 =

$<<8*5=40>>40.\nSo in total

Kylar needs to pay 24 + 40 =

$<<24+40=64>>64 for the

glasses he wants to

buy.\n#### 64

The first glass costs $5.\nThe

second glass costs $5 * 60%

= $<<5*60%=3>>3.\nThe

third glass costs $5.\nThe

fourth glass costs $5 * 60% =

$<<5*60%=3>>3.\nThe fifth

glass costs $5.\nThe sixth

glass costs $5 * 60% =

$<<5*60%=3>>3.\nThe

seventh glass costs $5.\nThe

eighth glass costs $5 * 60% =

$<<5*60%=3>>3.\nThe

ninth glass costs $5.\nThe

tenth glass costs $5 * 60% =

$<<5*60%=3>>3.\nThe

eleventh glass costs $5.\nThe

twelfth glass costs $5 * 60%

= $<<5*60%=3>>3.\nThe

thirteenth glass costs

$5.\nThe fourteenth glass

costs $5 * 60% =

$<<5*60%=3>>3.\nThe

fifteenth glass costs $5.\nThe

sixteenth glass costs $5 *

60%

The cost of one glass is $5,

and the cost of the second

glass is $5 *

$<<5*2=10>>10>>10 is the

cost of the third glass.\nThe

cost of the fourth glass is $5

* $<<5*3=15>>15 is the cost

of the fifth glass.\nThe cost

of the sixth glass is $5 *

$<<5*4=20>>20 is the cost

of the seventh glass.\nThe

cost of the eighth glass is $5

* $<<5*5=25>>25 is the cost

of the ninth glass.\nThe cost

of the tenth glass is $5 *

$<<5*6=30>>30 is the cost

of the eleventh glass.\nThe

cost of the twelfth glass is $5

* $<<5*7=35>>35 is the cost

of the thirteenth glass.\nThe

cost of the fourteenth glass is

$5 * $<<5*8=40>>40 is the

cost of the fifteenth

glass.\nThe cost of the

sixteenth glass is $5 *

$<<5*9=45>>4

The first glass costs $5.\nThe

second glass costs $5 * 60%

= $<<5*60%=3>>3.\nThe

third glass costs $5 * 60% *

60% =

$<<5*60%*60%=1.8>>1.8.\

nThe fourth glass costs $5 *

60% * 60% * 60% =

$<<5*60%*60%*60%=1.08

>>1.08.\nThe fifth glass costs

$5 * 60% * 60% * 60% *

60% =

$<<5*60%*60%*60%*60%

=0.648>>0.648.\nThe sixth

glass costs $5 * 60% * 60% *

60% * 60% * 60% =

$<<5*60%*60%*60%*60%*

60%=0.3888>>0.3888.\nThe

seventh glass costs $5 * 6

The first glass costs $5.\nThe

second glass costs 60% of $5,

which is

$<<5*60%=3>>3.\nThe third

glass costs 60% of $5, which

is $<<5*60%=3>>3.\nThe

fourth glass costs 60% of $5,

which is

$<<5*60%=3>>3.\nThe fifth

glass costs 60% of $5, which

is $<<5*60%=3>>3.\nThe

sixth glass costs 60% of $5,

which is

$<<5*60%=3>>3.\nThe

seventh glass costs 60% of

$5, which is

$<<5*60%=3>>3.\nThe

eighth glass costs 60% of $5,

which is

$<<5*60%=3>>3.\nThe

ninth glass costs 60% of $5,

which is

$<<5*60%=3>>3.\nThe tenth

glass costs 60% of $5, which

is $<<5*60%=3>>3.\nThe

eleventh glass costs 60%

Question: Janet’s ducks lay

16 eggs per day. She eats

three for breakfast every

morning and bakes muffins

for her friends every day with

four. She sells the remainder

at the farmers’ market daily

for $2 per fresh duck egg.

How much in dollars does

she make every day at the

farmers’ market?\nAnswer:

Janet sells 16 - 3 - 4 = <<16-

3-4=9>>9 duck eggs a

day.\nShe makes 9 * 2 =

$<<9*2=18>>18 every day at

the farmer’s market.\n####

Janet eats 3 eggs for

breakfast and bakes 4

muffins, so she uses 3 + 4 =

<<3+4=7>>7 eggs.\nShe

sells the remainder, so she

sells 16 - 7 = <<16-7=9>>9

eggs.\nShe sells each egg for

$2, so she makes $2 * 9 =

<<2*9=18>>18

dollars.\n#### 18\n\n

Janet sold 16 eggs per day.

She sold three eggs for

breakfast every morning, so

she sold 16 - 3 = <<16-

3=13>>13 eggs for

breakfast.\nShe sold four

eggs for her friends every

day, so she sold 16 - 4 =

<<16-4=12>>12 eggs for her

friends.\nShe sold twelve

eggs at the farmers’ market

daily, so she sold twelve eggs

at the farmers’ market daily,

so she sold twelve eggs at the

farmers’ market daily, so she

sold twelve eggs at the

farmers’ market daily, so she

sold twelve eggs at

Janet eats 3 eggs and bakes 4,

so she sells 16 - 3 - 4 = <<16-

3-4=9>>9 eggs.\nShe sells 9

eggs for $2 each, so she

makes $2 * 9 =

<<2*9=18>>18

dollars.\n#### 18\n\n

Janet eats 3 eggs for

breakfast and bakes 4

muffins, so she uses 3 + 4 =

<<3+4=7>>7 eggs.\nShe

sells the remainder at $2 per

egg, so she makes $2 * (16 -

7) = <<2*(16-

7)=2*9=18>>$18 per

day.\n#### 18\n\n