vllm.config.quantization ¶
OnlineQuantScheme ¶
Bases: Enum
Supported online quantization schemes.
Source code in vllm/config/quantization.py
OnlineQuantizationConfigArgs ¶
Configuration for online quantization.
Controls how OnlineQuantizationConfig is applied to a model. At least one of global_scheme, linear_scheme_override, or moe_scheme_override must be set.
Source code in vllm/config/quantization.py
global_scheme class-attribute instance-attribute ¶
global_scheme: OnlineQuantScheme | None = None
Quantization scheme applied to every supported layer.
ignore class-attribute instance-attribute ¶
Layers to skip quantization for. Supports exact names and regex patterns with re: prefix (e.g. re:.*attn.*), consistent with compressed_tensors layer skipping.
linear_scheme_override class-attribute instance-attribute ¶
linear_scheme_override: OnlineQuantScheme | None = None
Quantization scheme override for LinearBase layers.
moe_scheme_override class-attribute instance-attribute ¶
moe_scheme_override: OnlineQuantScheme | None = None
Quantization scheme override for FusedMoE layers.
resolve_online_quant_config ¶
resolve_online_quant_config(
quantization: str | None,
quantization_config: dict[str, Any]
| OnlineQuantizationConfigArgs
| None,
) -> OnlineQuantizationConfigArgs | None
Resolve online quant scheme shorthand into a quantization config.
If quantization is an online quant scheme (e.g. 'fp8_per_tensor'), ensures quantization_config has a matching global_scheme and casts it to :class:OnlineQuantizationConfigArgs if needed.