38 lines (33 loc) · 1.85 KB

Implemented Compression Methods

Post-training Compression

Post Training Quantization (PTQ) (OpenVINO, PyTorch, TorchFX, ONNX)
- Symmetric and asymmetric quantization modes
- Signed and unsigned
- Per tensor/per channel
- Each backend support export to the OpenVINO format
Weights compression (OpenVINO, PyTorch, TorchFX, ONNX)
- Symmetric 8 bit compression mode
- Symmetric and asymmetric 4 bit compression mode
- NF4 compression mode
- Arbitrary look-up table (CODEBOOK) or predefined lookup table based on NF4 (CB4)
- MX-compliant types - MXFP4 and MXFP8_E4M3
- FP types - FP8_E4M3 and FP4
- NVFP4 type
- Mixed precision weights compression
- Grouped weights compression

Training Time Compression

Quantization Aware Training (QAT) (PyTorch)
- Training of a quantized model after the Post Training Quantization
- Symmetric and asymmetric quantization modes
- Signed and unsigned
- Per tensor/per channel
- Exports to OpenVINO format
Weight-Only Quantization-Aware Training (QAT) with absorbable Low-Rank Adapters (LoRA) (PyTorch)
- Post Training Weight Compression as initialization
- 2 formats (FQ_LORA and FQ_LORA_NLS) for 2 use cases: general accuracy improvement via distillation and tuning for downstream tasks
- Symmetric and asymmetric quantization modes
- Signed and unsigned
- Per channel quantization for 8bit and group-wise quantization for 4bit
- Exports to OpenVINO format with packed weight constant and decompressor
Pruning (PyTorch)
- Unstructured pruning