What is Model Compression
Reducing ML model size
Model Compression is a set of techniques for reducing the size and computational requirements of ML models without significant quality loss.
Compression Methods
- Quantization — reducing weight precision (FP32 → INT8)
- Pruning — removing insignificant connections
- Knowledge distillation — training small model on large one
- Low-rank factorization — decomposing weight matrices
Benefits
- Size reduction by 4-10x
- Inference speedup by 2-5x
- Reduced power consumption
- Edge device deployment
- Infrastructure cost savings
Applications
- Mobile applications
- IoT and embedded systems
- Browser-based ML apps
- Real-time systems
- Autonomous devices