Skip to content

JointQ

JointQ Quantizer

JointQ dataclass

JointQ(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = (lambda: ['per_layer_model_projection'])(), target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float64, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = True, flag_hessian: bool = False, flag_xtx: bool = True, bits: int = 4, symmetric: bool = False, group_size: Optional[int] = 128, log_level: int = 0, device: Optional[device] = None, regularization_lambda: Optional[float] = 0.1, regularization_mode: str = 'diagonal', regularization_gamma: float = 0.5, lambda_mode: str = 'fixed_lambda', lambda_list: Optional[List[float]] = None, incremental_eps_y: float = 0.03, incremental_eps_w: float = 0.1, incremental_initial_skip_ew_threshold: Optional[float] = 0.3, actorder: bool = False, ils_enabled: bool = False, ils_num_iterations: int = 10, ils_num_clones: int = 8, ils_num_channels: Optional[int] = None, enable_clip_optimize: bool = True, enable_clip_optimize_ep: bool = False, enable_gptq: bool = True, gptq: Optional[GPTQ] = None)

Bases: Quantizer

JointQ quantizer class.

JointQ is a post-training quantization method that combines multiple initialization strategies (Clip-Optimize, Clip-Optimize-EP, GPTQ) with local search optimization to find high-quality quantized weights.

Attributes:

Name Type Description
bits int

Number of bits for quantization. Default is 4.

symmetric bool

Whether to use symmetric quantization. Default is False.

group_size int or None

Group size for quantization. Default is 128. If None, per-channel quantization is used (group_size = in_features).

log_level int

Log level (0: none, 1: minimal, 2: detailed). Default is 0.

device device or None

Device for quantization. If None, uses the device of the module being quantized.

regularization_lambda float or None

Tikhonov regularization strength. Default is 0.2. Replaces X^T X with X^T X + nlambdaR where R depends on regularization_mode. lambda is relative to the normalized Hessian (1/n)X^T X, so its meaning is consistent across different calibration sample sizes. Recommended range: 0.1 to 1.0. Set to None or 0.0 to disable. Used only in fixed_lambda mode.

regularization_mode str

Shape of the regularization matrix R. "identity" (default): R = I (standard Tikhonov). "diagonal": R = diag(a) where a_i = (diag(X^T X)_i / mean(diag(X^T X))) ^ gamma. This makes regularization importance-aware: columns with larger activations receive stronger regularization. Only supported with lambda_mode="fixed_lambda".

regularization_gamma float

Exponent for the diagonal weights in "diagonal" mode. Default is 0.5. Smaller values reduce the spread between weak and strong columns.

lambda_mode str

Regularization mode. Default is "fixed_lambda". "fixed_lambda": Use a single fixed regularization_lambda for all layers (existing behavior). "incremental_lambda": For each layer, try increasing lambda values from lambda_list and accept the solution as long as it improves weight error without substantially degrading output error.

lambda_list list of float or None

Ascending list of lambda values to try in incremental_lambda mode. Ignored in fixed_lambda mode. Default is [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5].

incremental_eps_y float

Maximum tolerated relative output-error increase when accepting a candidate in incremental_lambda mode. Default is 0.03 (3%).

incremental_eps_w float

Minimum required relative weight-error decrease to accept a candidate whose output error worsened in incremental_lambda mode. Default is 0.10 (10%).

incremental_initial_skip_ew_threshold float or None

If the first incremental candidate uses lambda=0.0 and its relative weight error exceeds this threshold, skip that candidate and try the next lambda instead of accepting it as the initial solution. This guard is only relevant when lambda_list starts with 0.0. Default is 0.3 (30%). Set to None to disable this guard.

actorder bool

Whether to reorder columns by activation magnitude (Hessian diagonal) before quantization. Default is False. When enabled, columns with larger activations are grouped together, improving group quantization efficiency and GPTQ initial solution quality.

ils_enabled bool

Whether to enable Iterated Local Search. Default is False.

ils_num_iterations int

Number of ILS iterations. Default is 10.

ils_num_clones int

Number of clones per row in ILS. Default is 8.

ils_num_channels int or None

Number of rows targeted per ILS iteration. When None, automatically set to min(dim_p, 1024). Default is None.

enable_clip_optimize bool

Whether to use Clip-Optimize initialization. Default is True.

enable_clip_optimize_ep bool

Whether to use Clip-Optimize with Error Propagation initialization. Default is False.

enable_gptq bool

Whether to use GPTQ initialization. Default is True.

gptq GPTQ or None

GPTQ instance for initial solution generation. If None, a default GPTQ is created from bits/group_size/symmetric. Pass a custom GPTQ instance to control parameters like blocksize, percdamp, mse, q_grid, q_norm. The GPTQ instance must have wbits/groupsize/sym matching JointQ's bits/group_size/symmetric, and actorder must be False.

Example

Basic usage::

from onecomp.quantizer.jointq import JointQ

quantizer = JointQ(
    bits=4,
    symmetric=False,
    group_size=128,
)

With all initialization strategies enabled::

quantizer = JointQ(
    bits=4,
    symmetric=False,
    group_size=128,
    enable_clip_optimize=True,
    enable_clip_optimize_ep=True,
    enable_gptq=True,
)

With custom GPTQ parameters::

from onecomp.quantizer.gptq import GPTQ

quantizer = JointQ(
    bits=4,
    symmetric=False,
    group_size=128,
    gptq=GPTQ(
        wbits=4, groupsize=128, sym=False, mse=True
    ),
)

With incremental lambda mode::

quantizer = JointQ(
    bits=4,
    symmetric=False,
    group_size=128,
    lambda_mode="incremental_lambda",
)

validate_params

validate_params()

Validate JointQ and GPTQ parameters.

Called once during setup(). Validates:

JointQ parameters

bits: int >= 1 group_size: int >= 1 or None log_level: int in {0, 1, 2} ils_num_iterations: int >= 1 (when ils_enabled) ils_num_clones: int >= 1 (when ils_enabled) ils_num_channels: int >= 1 or None (when ils_enabled)

GPTQ consistency

gptq.wbits == bits gptq.groupsize == group_size (or -1 when group_size is None) gptq.sym == symmetric gptq.actorder == False

Also delegates to self.gptq.validate_params() for GPTQ's own parameter validation (blocksize, percdamp, etc.).

quantize_layer

quantize_layer(module, input=None, hessian=None, matrix_XX=None, dim_n=None)

Quantize a single layer.

Processing flow
  1. Extract weight matrix from module
  2. Prepare matrix_XX (= X^T X) from input or use precomputed
  3. Apply activation ordering (actorder) if enabled
  4. Generate GPTQ initial solution (if enable_gptq=True), using the pre-regularization hessian
  5. Convert GPTQ result to JointQ Solution format
  6. Prepare ILS parameters
  7. Apply Tikhonov regularization to matrix_XX
  8. Run JointQ quantization with initial solutions
  9. Return quantization result

When lambda_mode="incremental_lambda", steps 7-8 are replaced by an iterative loop that tries each value in lambda_list and keeps the solution as long as it improves weight error without substantially degrading output error.

Parameters:

Name Type Description Default
module Module

The layer module to quantize.

required
input tuple or Tensor

Input activations. Used to compute matrix_XX when matrix_XX is not provided.

None
hessian Tensor

Not used in JointQ (ignored).

None
matrix_XX Tensor

Precomputed X^T X (FP64). If provided, this is used instead of input.

None
dim_n int

Number of samples. Required when matrix_XX is provided.

None

Returns:

Name Type Description
JointQResult

Quantization result containing scale, zero_point, assignment, and perm (column permutation when actorder is used).

execute_post_processing

execute_post_processing()

Log accepted_lambda statistics after all layers are quantized.