SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
-
Updated
Mar 19, 2026 - Python
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
Compress context data to optimize memory and performance in C++ large language model applications within the llm-cpp toolkit.
Add a description, image, and links to the smoothquant topic page so that developers can more easily learn about it.
To associate your repository with the smoothquant topic, visit your repo's landing page and select "manage topics."