GPTQ¶
GPTQ Quantizer¶
GPTQ
dataclass
¶
GPTQ(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = True, flag_hessian: bool = True, flag_xtx: bool = False, blocksize: int = 128, percdamp: float = 0.01, wbits: int = 4, groupsize: int = -1, actorder: bool = False, mse: bool = False, sym: bool = True, q_grid: int = 600, q_norm: float = 2.4, mlp_wbits: Optional[int] = None, mlp_groupsize: Optional[int] = None, module_wbits: Optional[dict[str, int]] = None)
Bases: Quantizer
GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers) quantizer.
Performs layer-wise weight quantization using second-order (Hessian) information to minimize the output reconstruction error. Supports grouped quantization, activation-order column reordering, and MSE-based grid search for optimal scale/zero-point parameters.
GPTQ requires calibration data and Hessian matrix computation.
Attributes:
| Name | Type | Description |
|---|---|---|
flag_calibration |
bool
|
Whether calibration data is needed (True for GPTQ). |
flag_hessian |
bool
|
Whether Hessian matrix is needed (True for GPTQ). |
blocksize |
int
|
Number of columns quantized together in each block. Larger values may improve quality but increase memory usage. Default is 128. |
percdamp |
float
|
Percentage of the Hessian diagonal average added for numerical stability (dampening). Default is 0.01. |
wbits |
int
|
Quantization bit width (1-8). Default is 4. |
groupsize |
int
|
Number of columns sharing the same scale/zero-point. -1 means per-channel (no grouping). Must be -1 or in 1..blocksize. Default is -1. |
actorder |
bool
|
If True, reorder columns by decreasing activation magnitude before quantization (desc_act). Default is False. |
mse |
bool
|
If True, use MSE-based grid search to find optimal scale and zero-point parameters. Default is False. |
sym |
bool
|
If True, use symmetric quantization (zero-point fixed at midpoint). If False, use asymmetric quantization (zero-point computed from data). Default is True. |
q_grid |
int
|
Number of grid points for MSE-based scale search (used when mse=True). Default is 600. |
q_norm |
float
|
Norm exponent for MSE grid search error metric (used when mse=True). Default is 2.4. |
Example
from onecomp.quantizer.gptq import GPTQ quantizer = GPTQ(wbits=4, groupsize=128) quantizer = GPTQ(wbits=4, groupsize=128, sym=False, actorder=True)
resolve_bits
staticmethod
¶
resolve_bits(layer_name: Optional[str], default_bits: int, mlp_bits: Optional[int] = None, module_bits: Optional[dict[str, int]] = None) -> int
Resolve bit-width from overrides (GPTQ semantics: module > mlp > default).
Used by the quantizer and by config loader. If layer_name is None, returns default_bits. Does not validate range; caller may validate.
resolve_groupsize
staticmethod
¶
resolve_groupsize(layer_name: Optional[str], default_groupsize: int, mlp_groupsize: Optional[int] = None) -> int
Resolve group_size (mlp override > default).
validate_params ¶
Validate GPTQ parameters once in setup().
Validated ranges
blocksize: int >= 1 percdamp: float >= 3.95e-4 wbits: int, 1 <= wbits <= 63 groupsize: int, -1 or >= 1 q_grid: int >= 1 (when mse=True) q_norm: float > 0 (when mse=True)
quantize_layer ¶
Quantize the layer
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
module
|
Module
|
The layer module |
required |
input
|
tuple or Tensor
|
The input to the layer (input activations) |
required |
hessian
|
Tensor
|
The Hessian matrix |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
GPTQResult |
GPTQ quantization result object. |
get_quant_config ¶
Return quantization_config dict for save_quantized_model(HF/vLLM compatible keys).
Structure: all keys at top-level (quant_method, bits, group_size, actorder, sym, checkpoint_format, optional mlp_wbits / module_wbits).
When module_wbits is non-empty (mixed-bit model), quant_method is set to
"mixed_gptq" so vLLM can dispatch per-module kernels via the mixed_gptq plugin.
The quantization_bits list (indexed by transformer layer) is injected by
finalize_quant_config_for_save after the quantized names are known.
checkpoint_format is always "gptq" (v1). OneComp GPTQLinear stores zero-points
with the -1 offset convention (v1) unconditionally, so "gptq_v2" would cause an
off-by-one mismatch when loaded by vLLM.
create_inference_layer ¶
Build GPTQLinear from GPTQResult.
GPTQResult¶
GPTQResult
dataclass
¶
GPTQResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None, wbits: int = None, groupsize: int = None, actorder: bool = None, sym: bool = None, qweight: Optional[Tensor] = None, scales: Optional[Tensor] = None, qzeros: Optional[Tensor] = None, perm: Optional[Tensor] = None)
Bases: QuantizationResult
GPTQ quantization result class.
Inherits from QuantizationResult and adds GPTQ-specific parameters.
Attributes:
| Name | Type | Description |
|---|---|---|
dequantized_weight |
Tensor
|
Dequantized weights (FP16, CPU) - inherited from parent class. |
wbits |
int
|
Quantization bit width. |
groupsize |
int
|
Group size (-1 means no grouping). |
actorder |
bool
|
Whether columns were reordered by activation order. |
sym |
bool
|
Whether symmetric quantization was used. |
qweight |
Optional[Tensor]
|
Quantized weights (INT type, CPU). |
scales |
Optional[Tensor]
|
Scale coefficients (FP16, CPU). |
qzeros |
Optional[Tensor]
|
Zero points (FP16, CPU). |
perm |
Optional[Tensor]
|
Column permutation order (used when actorder=True). |
Note
- g_idx (group index) is not stored since it can be computed from groupsize and perm. Computation: g_idx[perm[i]] = i // groupsize (when actorder=True) g_idx[i] = i // groupsize (when actorder=False)
- invperm (inverse permutation) is not stored since it can be computed from perm. Computation: invperm = torch.argsort(perm)
compute_dequantized_weight ¶
Compute dequantized weight from quantized data and quantization parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
str or device
|
Device to compute on. |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Dequantized weight tensor (FP16, CPU). |