Radio: Rate–Distortion Optimization for LLM Compression
Sean I. Young
Harvard Medical School MIT
1. Motivation for Radio for LLM Quantization
Three main ingredients in modern LLM compression:
Sensitivity — Some weights more important/sensitive than others.
Grouping — Group weights reduce the overall weight entropy.
Non-uniform — Adapt quantization to tailed weight distribution.
These ingredients can be interpreted through a R–D framework.
Formulate LLM quantization as R–D optimization.
2. Quantization as R–D Optimization
Formulate quantization of model 𝑓 as end-to-end optimization:
𝐵
𝑛
, 𝑛 = 1, . . . , 𝑁 are bit depths (knobs) for weight 𝚯
𝑛
𝑞
𝑅 is a given target bit rate (average bit depth)
Non-linear constrained least-squares problem
Optimality conditions:
𝐵
𝑛
, 𝑛 = 1, . . . , 𝑁 are bit depths (knobs) for weight 𝚯
𝑛
𝑞
3. R–D Optimality and Sensitivity
Consider two weight groups with R–D curves 𝑑
1
(𝐵
1
) and 𝑑
2
(𝐵
2
).
𝑑
𝑛
(𝐵
𝑛
) 𝐺
𝑛
2
𝑆
𝑛
2
2
2𝐵
𝑛
.
𝐺
𝑛
2
— gradient variance, 𝑆
𝑛
2
— weight variance.
Optimal bit depth is where distortion slope is 𝑉 (and 𝑉 satisfies 𝑅).
4. Companded Quantization
Uniform quantization leads to large quantization bins for probable values.
Compand before uniform quantization for asymptotic optimality:
Companding function for Laplace (light-trailed) distributed weights:
𝑆 is weight variance and 𝜇 is weight mean.
Alternative to Lloyd–Max if weight distribution is parametric.
5. Grouping Reduces Variance
Follows directly from Jansen’s inequality.
Assign bit depth according to reduced variances.
6. Iterative (Re-) Quantization
Quantization changes sensitivity of weights. Quantize, Re-estimate, Repeat.
7. Model Compression Experiments