GPTQ¶

GPTQ Quantizer¶

GPTQ `dataclass` ¶

GPTQ(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = True, flag_hessian: bool = True, flag_xtx: bool = False, blocksize: int = 128, percdamp: float = 0.01, wbits: int = 4, groupsize: int = -1, actorder: bool = False, mse: bool = False, sym: bool = True, q_grid: int = 600, q_norm: float = 2.4, mlp_wbits: Optional[int] = None, mlp_groupsize: Optional[int] = None, module_wbits: Optional[dict[str, int]] = None)

Bases: Quantizer

GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers) quantizer.

Performs layer-wise weight quantization using second-order (Hessian) information to minimize the output reconstruction error. Supports grouped quantization, activation-order column reordering, and MSE-based grid search for optimal scale/zero-point parameters.

GPTQ requires calibration data and Hessian matrix computation.

Attributes:

Name	Type	Description
`flag_calibration`	`bool`	Whether calibration data is needed (True for GPTQ).
`flag_hessian`	`bool`	Whether Hessian matrix is needed (True for GPTQ).
`blocksize`	`int`	Number of columns quantized together in each block. Larger values may improve quality but increase memory usage. Default is 128.
`percdamp`	`float`	Percentage of the Hessian diagonal average added for numerical stability (dampening). Default is 0.01.
`wbits`	`int`	Quantization bit width (1-8). Default is 4.
`groupsize`	`int`	Number of columns sharing the same scale/zero-point. -1 means per-channel (no grouping). Must be -1 or in 1..blocksize. Default is -1.
`actorder`	`bool`	If True, reorder columns by decreasing activation magnitude before quantization (desc_act). Default is False.
`mse`	`bool`	If True, use MSE-based grid search to find optimal scale and zero-point parameters. Default is False.
`sym`	`bool`	If True, use symmetric quantization (zero-point fixed at midpoint). If False, use asymmetric quantization (zero-point computed from data). Default is True.
`q_grid`	`int`	Number of grid points for MSE-based scale search (used when mse=True). Default is 600.
`q_norm`	`float`	Norm exponent for MSE grid search error metric (used when mse=True). Default is 2.4.

Example

from onecomp.quantizer.gptq import GPTQ quantizer = GPTQ(wbits=4, groupsize=128) quantizer = GPTQ(wbits=4, groupsize=128, sym=False, actorder=True)

resolve_bits `staticmethod` ¶

resolve_bits(layer_name: Optional[str], default_bits: int, mlp_bits: Optional[int] = None, module_bits: Optional[dict[str, int]] = None) -> int

Resolve bit-width from overrides (GPTQ semantics: module > mlp > default).

Used by the quantizer and by config loader. If layer_name is None, returns default_bits. Does not validate range; caller may validate.

resolve_groupsize `staticmethod` ¶

resolve_groupsize(layer_name: Optional[str], default_groupsize: int, mlp_groupsize: Optional[int] = None) -> int

Resolve group_size (mlp override > default).

validate_params ¶

validate_params()

Validate GPTQ parameters once in setup().

Validated ranges

blocksize: int >= 1 percdamp: float >= 3.95e-4 wbits: int, 1 <= wbits <= 63 groupsize: int, -1 or >= 1 q_grid: int >= 1 (when mse=True) q_norm: float > 0 (when mse=True)

quantize_layer ¶

quantize_layer(module, input, hessian=None)

Quantize the layer

Parameters:

Name	Type	Description	Default
`module`	`Module`	The layer module	required
`input`	`tuple or Tensor`	The input to the layer (input activations)	required
`hessian`	`Tensor`	The Hessian matrix	`None`

Returns:

Name	Type	Description
`GPTQResult`		GPTQ quantization result object.

get_quant_config ¶

get_quant_config() -> dict

Return quantization_config dict for save_quantized_model(HF/vLLM compatible keys).

Structure: all keys at top-level (quant_method, bits, group_size, actorder, sym, checkpoint_format, optional mlp_wbits / module_wbits).

When module_wbits is non-empty (mixed-bit model), quant_method is set to "mixed_gptq" so vLLM can dispatch per-module kernels via the mixed_gptq plugin. The quantization_bits list (indexed by transformer layer) is injected by finalize_quant_config_for_save after the quantized names are known.

checkpoint_format is always "gptq" (v1). OneComp GPTQLinear stores zero-points with the -1 offset convention (v1) unconditionally, so "gptq_v2" would cause an off-by-one mismatch when loaded by vLLM.

create_inference_layer ¶

create_inference_layer(result, linear_module, **kwargs)

Build GPTQLinear from GPTQResult.

GPTQResult¶

GPTQResult `dataclass` ¶

GPTQResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None, wbits: int = None, groupsize: int = None, actorder: bool = None, sym: bool = None, qweight: Optional[Tensor] = None, scales: Optional[Tensor] = None, qzeros: Optional[Tensor] = None, perm: Optional[Tensor] = None)

Bases: QuantizationResult

GPTQ quantization result class.

Inherits from QuantizationResult and adds GPTQ-specific parameters.

Attributes:

Name	Type	Description
`dequantized_weight`	`Tensor`	Dequantized weights (FP16, CPU) - inherited from parent class.
`wbits`	`int`	Quantization bit width.
`groupsize`	`int`	Group size (-1 means no grouping).
`actorder`	`bool`	Whether columns were reordered by activation order.
`sym`	`bool`	Whether symmetric quantization was used.
`qweight`	`Optional[Tensor]`	Quantized weights (INT type, CPU).
`scales`	`Optional[Tensor]`	Scale coefficients (FP16, CPU).
`qzeros`	`Optional[Tensor]`	Zero points (FP16, CPU).
`perm`	`Optional[Tensor]`	Column permutation order (used when actorder=True).

Note

g_idx (group index) is not stored since it can be computed from groupsize and perm. Computation: g_idx[perm[i]] = i // groupsize (when actorder=True) g_idx[i] = i // groupsize (when actorder=False)
invperm (inverse permutation) is not stored since it can be computed from perm. Computation: invperm = torch.argsort(perm)

compute_dequantized_weight ¶

compute_dequantized_weight(device=None) -> torch.Tensor

Compute dequantized weight from quantized data and quantization parameters.