Skip to content

GPTQ

GPTQ Quantizer

GPTQ dataclass

GPTQ(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = True, flag_hessian: bool = True, flag_xtx: bool = False, blocksize: int = 128, percdamp: float = 0.01, wbits: int = 4, groupsize: int = -1, actorder: bool = False, mse: bool = False, sym: bool = True, q_grid: int = 600, q_norm: float = 2.4, mlp_wbits: Optional[int] = None, mlp_groupsize: Optional[int] = None, module_wbits: Optional[dict[str, int]] = None)

Bases: Quantizer

GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers) quantizer.

Performs layer-wise weight quantization using second-order (Hessian) information to minimize the output reconstruction error. Supports grouped quantization, activation-order column reordering, and MSE-based grid search for optimal scale/zero-point parameters.

GPTQ requires calibration data and Hessian matrix computation.

Attributes:

Name Type Description
flag_calibration bool

Whether calibration data is needed (True for GPTQ).

flag_hessian bool

Whether Hessian matrix is needed (True for GPTQ).

blocksize int

Number of columns quantized together in each block. Larger values may improve quality but increase memory usage. Default is 128.

percdamp float

Percentage of the Hessian diagonal average added for numerical stability (dampening). Default is 0.01.

wbits int

Quantization bit width (1-8). Default is 4.

groupsize int

Number of columns sharing the same scale/zero-point. -1 means per-channel (no grouping). Must be -1 or in 1..blocksize. Default is -1.

actorder bool

If True, reorder columns by decreasing activation magnitude before quantization (desc_act). Default is False.

mse bool

If True, use MSE-based grid search to find optimal scale and zero-point parameters. Default is False.

sym bool

If True, use symmetric quantization (zero-point fixed at midpoint). If False, use asymmetric quantization (zero-point computed from data). Default is True.

q_grid int

Number of grid points for MSE-based scale search (used when mse=True). Default is 600.

q_norm float

Norm exponent for MSE grid search error metric (used when mse=True). Default is 2.4.

Example

from onecomp.quantizer.gptq import GPTQ quantizer = GPTQ(wbits=4, groupsize=128) quantizer = GPTQ(wbits=4, groupsize=128, sym=False, actorder=True)

resolve_bits staticmethod

resolve_bits(layer_name: Optional[str], default_bits: int, mlp_bits: Optional[int] = None, module_bits: Optional[dict[str, int]] = None) -> int

Resolve bit-width from overrides (GPTQ semantics: module > mlp > default).

Used by the quantizer and by config loader. If layer_name is None, returns default_bits. Does not validate range; caller may validate.

resolve_groupsize staticmethod

resolve_groupsize(layer_name: Optional[str], default_groupsize: int, mlp_groupsize: Optional[int] = None) -> int

Resolve group_size (mlp override > default).

validate_params

validate_params()

Validate GPTQ parameters once in setup().

Validated ranges

blocksize: int >= 1 percdamp: float >= 3.95e-4 wbits: int, 1 <= wbits <= 63 groupsize: int, -1 or >= 1 q_grid: int >= 1 (when mse=True) q_norm: float > 0 (when mse=True)

quantize_layer

quantize_layer(module, input, hessian=None)

Quantize the layer

Parameters:

Name Type Description Default
module Module

The layer module

required
input tuple or Tensor

The input to the layer (input activations)

required
hessian Tensor

The Hessian matrix

None

Returns:

Name Type Description
GPTQResult

GPTQ quantization result object.

get_quant_config

get_quant_config() -> dict

Return quantization_config dict for save_quantized_model(HF/vLLM compatible keys).

Structure: all keys at top-level (quant_method, bits, group_size, actorder, sym, checkpoint_format, optional mlp_wbits / module_wbits).

When module_wbits is non-empty (mixed-bit model), quant_method is set to "mixed_gptq" so vLLM can dispatch per-module kernels via the mixed_gptq plugin. The quantization_bits list (indexed by transformer layer) is injected by finalize_quant_config_for_save after the quantized names are known.

checkpoint_format is always "gptq" (v1). OneComp GPTQLinear stores zero-points with the -1 offset convention (v1) unconditionally, so "gptq_v2" would cause an off-by-one mismatch when loaded by vLLM.

create_inference_layer

create_inference_layer(result, linear_module, **kwargs)

Build GPTQLinear from GPTQResult.

GPTQResult

GPTQResult dataclass

GPTQResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None, wbits: int = None, groupsize: int = None, actorder: bool = None, sym: bool = None, qweight: Optional[Tensor] = None, scales: Optional[Tensor] = None, qzeros: Optional[Tensor] = None, perm: Optional[Tensor] = None)

Bases: QuantizationResult

GPTQ quantization result class.

Inherits from QuantizationResult and adds GPTQ-specific parameters.

Attributes:

Name Type Description
dequantized_weight Tensor

Dequantized weights (FP16, CPU) - inherited from parent class.

wbits int

Quantization bit width.

groupsize int

Group size (-1 means no grouping).

actorder bool

Whether columns were reordered by activation order.

sym bool

Whether symmetric quantization was used.

qweight Optional[Tensor]

Quantized weights (INT type, CPU).

scales Optional[Tensor]

Scale coefficients (FP16, CPU).

qzeros Optional[Tensor]

Zero points (FP16, CPU).

perm Optional[Tensor]

Column permutation order (used when actorder=True).

Note
  • g_idx (group index) is not stored since it can be computed from groupsize and perm. Computation: g_idx[perm[i]] = i // groupsize (when actorder=True) g_idx[i] = i // groupsize (when actorder=False)
  • invperm (inverse permutation) is not stored since it can be computed from perm. Computation: invperm = torch.argsort(perm)

compute_dequantized_weight

compute_dequantized_weight(device=None) -> torch.Tensor

Compute dequantized weight from quantized data and quantization parameters.

Parameters:

Name Type Description Default
device str or device

Device to compute on.

None

Returns:

Type Description
Tensor

Dequantized weight tensor (FP16, CPU).