Sean I Young, PhD — Rate–Distortion Optimization for LLM Compression

Radio: Rate–Distortion Optimization for LLM Compression

Sean I. Young

Harvard Medical School MIT

1. Motivation for Radio for LLM Quantization

• Three main ingredients in modern LLM compression:

• Sensitivity — Some weights more important/sensitive than others.

• Grouping — Group weights reduce the overall weight entropy.

• Non-uniform — Adapt quantization to tailed weight distribution.

• These ingredients can be interpreted through a R–D framework.

• Formulate LLM quantization as R–D optimization.

2. Quantization as R–D Optimization

• Formulate quantization of model 𝑓 as end-to-end optimization:

• 𝐵

𝑛

, 𝑛 = 1, . . . , 𝑁 are bit depths (knobs) for weight 𝚯

𝑛

𝑞

• 𝑅 is a given target bit rate (average bit depth)

• Non-linear constrained least-squares problem

• Optimality conditions:

• 𝐵

𝑛

, 𝑛 = 1, . . . , 𝑁 are bit depths (knobs) for weight 𝚯

𝑛

𝑞

3. R–D Optimality and Sensitivity

• Consider two weight groups with R–D curves 𝑑

(𝐵

) and 𝑑

(𝐵

• 𝑑

𝑛

(𝐵

𝑛

) ≝ 𝐺

𝑛

𝑆

𝑛

−2𝐵

𝑛

• 𝐺

𝑛

— gradient variance, 𝑆

𝑛

— weight variance.

• Optimal bit depth is where distortion slope is −𝑉 (and 𝑉 satisfies 𝑅).

4. Companded Quantization

• Uniform quantization leads to large quantization bins for probable values.

• Compand before uniform quantization for asymptotic optimality:

• Companding function for Laplace (light-trailed) distributed weights:

𝑆 is weight variance and 𝜇 is weight mean.

• Alternative to Lloyd–Max if weight distribution is parametric.

5. Grouping Reduces Variance

• Follows directly from Jansen’s inequality.

• Assign bit depth according to reduced variances.

6. Iterative (Re-) Quantization

• Quantization changes sensitivity of weights. Quantize, Re-estimate, Repeat.

7. Model Compression Experiments