
10
Radio: Rate–Distortion Optimization for Large Language Model Compression
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. Think you have Solved Question Answering?
Try ARC, the AI2 Reasoning Challenge.
http://arxiv.org/abs/1803.05457, 2018.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian et al.
Training verifiers to solve math word problems.
http://arxiv.org/abs/2110.14168, 2021.
Thomas M. Cover, and Joy A. Thomas. Elements of
Information Theory (Wiley Series in
Telecommunications and Signal Processing). Wiley-
Interscience, USA 2006.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke
Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication
for transformers at scale. In Proc. NeurIPS, 2022.
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian et al.
SpQR: A Sparse-Quantized Representation for near-
lossless LLM weight compression.
http://arxiv.org/abs/2306.03078, 2023.
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W.
Mahoney, and Kurt Keutzer. HAWQ: Hessian AWare
Quantization of neural networks with mixed precision.
In Proc. ICCV, 2019.
Va ge E gi az ar i an , An dr ei Pa nf er ov, D e ni s Ku zn ed el ev,
Elias Frantar, Artem Babenko, and Dan Alistarh.
Extreme compression of large language models via
additive quantization. In Proc. ICML, 2024.
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani,
Rathinakumar Appuswamy, and Dharmendra S. Modha.
Learned step size quantization. In Proc. ICLR, 2019.
Elias Frantar, and Dan Alistarh. Optimal Brain
Compression: A framework for accurate post-training
quantization and pruning. In Proc. NeurIPS, 2022.
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan
Alistarh. OPTQ: Accurate quantization for generative
pre-trained transformers. In Proc. ICLR, 2022.
Allen Gersho, and Robert M. Gray. Vector Quantization
and Signal Compression. Kluwer, Norwell, MA, USA
1991.
Yu nc ha o G o ng , L i u L i u, Mi n g Ya ng , a nd L ub om ir B ou rd ev.
Compressing deep convolutional networks using vector
quantization. In Proc. ICLR, 2015.
R.M. Gray, and D.L. Neuhoff. Quantization. IEEE Trans.
Inf. Theory, 44(6):2325–2383, 1998.
Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai
Wong, and Hao Yu. APTQ: Attention-aware post-
training mixed-precision quantization for large
language models. In Proc. DAC, 2024.
Babak Hassibi, and David Stork. Second order derivatives
for network pruning: Optimal Brain Surgeon. In Proc.
NIPS, 1992.
Lu Hou, and James T. Kwok. Loss-aware weight
quantization of deep networks. In Proc. ICLR, 2018.
Wei Huang, Haotong Qin, Yangdong Liu et al. SliM-LLM:
Salience-driven mixed-precision quantization for large
language models, https://arxiv.org/abs/2405.14917v1,
2024.
Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and
Daniel Soudry. Accurate post training quantization with
small calibration sets. In Proc. ICML, 2021.
Benoit Jacob, Skirmantas Kligys, Bo Chen et al.
Quantization and training of neural networks for
efficient integer-arithmetic-only inference. In Proc.
CVPR, 2018.
Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang,
Zhangyang Wang, Yinfei Yang. Compressing LLMs: the
truth is rarely pure and never simple. In Proc. ICLR, 2024.
Yo ngk we on J eo n, C hu ng ma n L ee , Kyu ng ph il Pa rk , Ho -
young Kim. A Frustratingly Easy Post-Training
Quantization Scheme for LLMs. In Proc. EMMLP, 2023.
Sehoon Kim et al. SqueezeLLM: Dense-and-sparse
quantization. In Proc. ICML, 2024.
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim,
and Eunhyeok Park. OWQ: Outlier-aware weight
quantization for efficient fine-tuning and inference of
large language models. In Proc. AAAI, 2024.
Ji Lin, Jiaming Tang, Haotian Tang et al. AWQ: Activation-
aware Weight Quantization for on-device LLM
compression and acceleration. In Proc. MLSys, 2024.
S. Lloyd. Least squares quantization in PCM. IEEE Trans.
Inf. Theory, 28(2):129–137, 1982.
J. Max. Quantizing for minimum distortion. IRE Trans. Inf.
Theory, 6(1):7–12, 1960.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. Pointer sentinel mixture models. In
Proc. ICLR, 2022.
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and
Max Welling. Data-free quantization through weight
equalization and bias correction. In Proc. CVPR, 2019.
Yu ry Na hs h an , B r ia n C hm i el , C h ai m B as k in , E v ge ni i
Zheltonozhskii, Ron Banner, Alex M. Bronstein, and
Avi Mendelson. Loss aware post-training quantization.
In Mach Learn 110 3245–3262, Springer, 2020.
Jorge Nocedal, and Stephen J. Wright. Numerical
Optimization. Springer, New York, NY, USA 2009.
Biao Qian, Yang Wang, Richang Hong, and Meng Wang.
Adaptive data-free quantization. In Proc. CVPR, 2023.
Zhongnan Qu, Zimu Zhou, Yun Cheng, and Lothar Thiele.
Adaptive loss-aware quantization for multi-bit networks.
In Proc. CVPR, 2020.