What is Quantization
Reducing computation precision for speed
Quantization is a neural network optimization technique where model weights and activations are converted from high-precision formats (FP32) to low-precision (INT8, INT4), reducing model size and speeding up inference.
Types of Quantization
- Post-Training Quantization (PTQ) — after model training
- Quantization-Aware Training (QAT) — during training
- Dynamic Quantization — during inference
- Static Quantization — with data calibration
Precision Formats
- FP32 — 32-bit floating point (original)
- FP16 — 16-bit (half precision)
- INT8 — 8-bit integer (4x compression)
- INT4 — 4-bit integer (8x compression)
Benefits
- Model size reduction by 2-8x
- Inference speedup by 2-4x
- Reduced power consumption
- Ability to run on edge devices
Tools
- TensorRT (NVIDIA)
- ONNX Runtime
- PyTorch quantization
- TensorFlow Lite