RTN¶

RTN Quantizer¶

RTN `dataclass` ¶

RTN(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = (lambda: ['per_layer_model_projection'])(), target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False, flag_qep_supported: bool = True, wbits: int = 4, groupsize: int = -1, sym: bool = False, mse: bool = False, norm: float = 2.4, grid: int = 100)

Bases: Quantizer

RTN (Round-To-Nearest) quantizer.

RTN is the simplest quantization method that rounds weights to the nearest quantization level. It does not require calibration data or Hessian matrices, performing quantization using only weight statistics.

Quantization method: - Computes minimum and maximum values of weights - Computes scale and zero point - Rounds weights to nearest quantization level (Round-To-Nearest)

RTN does not require calibration data or Hessian matrix. Fastest method but may have lower accuracy compared to other methods.

Attributes:

Name	Type	Description
`flag_calibration`	`bool`	Whether to use calibration data (False for RTN).
`flag_hessian`	`bool`	Whether to use Hessian matrix (False for RTN).
`wbits`	`int`	Number of quantization bits. Default is 4.
`groupsize`	`int`	Group size. Computes independent scale and zero point for each group. -1 means no grouping (single scale and zero point for entire row). Default is -1.
`sym`	`bool`	Whether to use symmetric quantization. If True, zero point is placed at center. Default is False.
`mse`	`bool`	Enable MSE grid search for optimal clipping. Default is False.
`norm`	`float`	Lp norm exponent for MSE search. Default is 2.4.
`grid`	`int`	Number of candidate shrink levels for MSE search. Default is 100.

Methods:

Name	Description
`quantize_layer`	Quantize a layer using RTN.

validate_params ¶

validate_params()

Validate RTN parameters once in setup().

Validated ranges

wbits: int, 1 <= wbits <= 64 groupsize: int, -1 or >= 1 sym: bool (no constraint) grid: int >= 1 (when mse=True) norm: float > 0 (when mse=True)

quantize_layer ¶

quantize_layer(module, input=None, hessian=None)

Quantize a layer using RTN.

Parameters:

Name	Type	Description	Default
`module`	`Module`	The layer module to quantize.	required
`input`	`tuple or Tensor`	Input tensor (not used in RTN). Default is None.	`None`
`hessian`	`Tensor`	Hessian matrix (not used in RTN). Default is None.	`None`

Returns:

Name	Type	Description
`RTNResult`		RTN quantization result object containing quantized weights and parameters.

Raises:

Type	Description
`ValueError`	If groupsize does not divide in_features.

get_quant_config ¶

get_quant_config() -> dict

Return GPTQ-compatible quantization config.

RTN uses the same tensor format as GPTQ (qweight/scales/qzeros), so we emit quant_method="gptq" to reuse GPTQLinear and vLLM GPTQ plugin.

create_inference_layer ¶

create_inference_layer(result, linear_module, **kwargs)

Build GPTQLinear from RTNResult.

RTN scale/zero shape is (out_features, num_groups) from pseudo_quantize_tensor. GPTQLinear expects (num_groups, out_features), so we transpose. RTN now stores unsigned qweight/zero even for symmetric mode, so no additional signed-to-unsigned shift is needed here.

RTNResult¶

RTNResult `dataclass` ¶

RTNResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None, wbits: int = None, groupsize: int = None, sym: bool = None, quantized_weight: Optional[Tensor] = None, scale: Optional[Tensor] = None, zero: Optional[Tensor] = None)

Bases: QuantizationResult

Result class for RTN quantization.

Inherits from QuantizationResult and adds RTN-specific parameters.