
Radio: Rate–Distortion Optimization for LLM Compression
Sean I. Young
Harvard Medical School MIT
1. Motivation for Radio for LLM Quantization
• Three main ingredients in modern LLM compression:
• Sensitivity — Some weights more important/sensitive than others.
• Grouping — Group weights reduce the overall weight entropy.
• Non-uniform — Adapt quantization to tailed weight distribution.
• These ingredients can be interpreted through a R–D framework.
• Formulate LLM quantization as R–D optimization.
2. Quantization as R–D Optimization
• Formulate quantization of model 𝑓 as end-to-end optimization:
• 𝐵
𝑛
, 𝑛 = 1, . . . , 𝑁 are bit depths (knobs) for weight 𝚯
𝑛
𝑞
• 𝑅 is a given target bit rate (average bit depth)
• Non-linear constrained least-squares problem
• Optimality conditions:
• 𝐵
𝑛
, 𝑛 = 1, . . . , 𝑁 are bit depths (knobs) for weight 𝚯
𝑛
𝑞
3. R–D Optimality and Sensitivity
• Consider two weight groups with R–D curves 𝑑
1
(𝐵
1
) and 𝑑
2
(𝐵
2
).
• 𝑑
𝑛
(𝐵
𝑛
) ≝ 𝐺
𝑛
2
𝑆
𝑛
2
2
−2𝐵
𝑛
.
• 𝐺
𝑛
2
— gradient variance, 𝑆
𝑛
2
— weight variance.
• Optimal bit depth is where distortion slope is −𝑉 (and 𝑉 satisfies 𝑅).
4. Companded Quantization
• Uniform quantization leads to large quantization bins for probable values.
• Compand before uniform quantization for asymptotic optimality:
• Companding function for Laplace (light-trailed) distributed weights:
𝑆 is weight variance and 𝜇 is weight mean.
• Alternative to Lloyd–Max if weight distribution is parametric.
5. Grouping Reduces Variance
• Follows directly from Jansen’s inequality.
• Assign bit depth according to reduced variances.
6. Iterative (Re-) Quantization
• Quantization changes sensitivity of weights. Quantize, Re-estimate, Repeat.
7. Model Compression Experiments