Base Classes¶

Quantizer¶

Abstract base class for all quantizers. Defines the common interface and shared functionality.

Quantizer Feature Support¶

Runner.save_quantized_model(), Runner.create_quantized_model(), and quantized-model PPL/ACC evaluation internally call get_quant_config() and create_inference_layer() on the quantizer. These methods raise NotImplementedError by default and must be overridden by each quantizer to enable these features.

Quantizer	`get_quant_config`	`create_inference_layer`	Save	Quantized PPL/ACC
`GPTQ`	Yes	Yes	Yes	Yes
`DBF`	Yes	Yes	Yes	Yes
`AutoBitQuantizer`	Yes	Yes	Yes	Yes
`RTN`	—	—	No	No (fallback)
`JointQ`	—	—	No	No (fallback)
`QUIP`	—	—	No	No (fallback)
`CQ`	—	—	No	No (fallback)
`ARB`	—	—	No	No (fallback)
`QBB`	—	—	No	No (fallback)
`Onebit`	—	—	No	No (fallback)

For quantizers without support:

PPL/ACC evaluation: calculate_perplexity() / calculate_accuracy() with quantized_model=True automatically falls back to the dequantized (FP16) model. No error is raised.
Saving: use save_dequantized_model() (FP16) or save_quantization_results() to persist results.

Quantizer `dataclass` ¶

Quantizer(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False)

Abstract base class for quantizers

Attributes:

Name	Type	Description
`num_layers`	`int`	The number of layers to be quantized, if None, all layers will be quantized.
`calc_quant_error`	`bool`	If True, calculate quantization error.
`exclude_layer_names`	`list[str]`	List of layer names to exclude from quantization (exact match).
`include_layer_names`	`list[str]`	List of layer names to include for quantization (exact match). If None, all layers are candidates.
`include_layer_keywords`	`list[str]`	List of keywords to match layer names for quantization. If any keyword is contained in the layer name, the layer is included. If None, all layers are candidates.
`exclude_layer_keywords`	`list[str]`	List of keywords to exclude layer names from quantization. If any keyword is contained in the layer name, the layer is excluded.
`target_layer_types`	`tuple`	Tuple of layer types to quantize. Default is (Linear,).

Layer Selection Priority

target_layer_types: Filter by layer type
include_layer_names: If specified, only include exact matches
include_layer_keywords: If specified, only include layers containing any keyword
exclude_layer_names: Exclude exact matches
exclude_layer_keywords: Exclude layers containing any keyword
num_layers: Limit the maximum number of layers

To create a new Quantizer: - Inherit from this class. - Must implement the following method: quantize_layer - Set flag_calibration to True if calibration data is needed - Set flag_hessian to True if the Hessian matrix is needed

Examples:

Example 1: Exclude lm_head (default behavior)¶

quantizer = GPTQ(exclude_layer_names=["lm_head"])

Example 2: Quantize only specific layers¶

quantizer = GPTQ( include_layer_names=["model.layers.0.self_attn.q_proj"] )

Example 3: Quantize layers containing specific keywords¶

quantizer = GPTQ( include_layer_keywords=["q_proj", "k_proj", "v_proj"] )

Example 4: Exclude layers containing specific keywords¶

quantizer = GPTQ( exclude_layer_keywords=["down_proj", "gate_proj"] )

__post_init__ ¶

__post_init__()

post_init method

setup ¶

setup(model)

Setup the quantizer with the model

Parameters:

Name	Type	Description	Default
`model`		The model to be quantized	required

quantize ¶

quantize(module, input, output)

Quantize the layer

This method is called by the register_forward_hook method of the layer.

quantize_layer `abstractmethod` ¶

quantize_layer(module, input=None, hessian=None) -> Union[torch.Tensor, QuantizationResult]

Quantize the layer

Parameters:

Name	Type	Description	Default
`module`	`Module`	The layer module	required
`input`	`tuple or Tensor`	The input to the layer	`None`
`hessian`	`Tensor`	The Hessian matrix	`None`

Returns:

Type	Description
`Union[Tensor, QuantizationResult]`	Union[torch.Tensor, QuantizationResult]: Dequantized weight (torch.Tensor), or a QuantizationResult object.

quantize_with_qep ¶

quantize_with_qep(module, quant_input_activation, original_input_activation=None, percdamp=0.01, perccorr=0.5, hessian=None, delta_hatX=None)

Quantize the layer with QEP

Parameters:

Name	Type	Description	Default
`module`	`Module`	The layer module	required
`quant_input_activation`	`Tensor`	The input activations of the quantized layer	required
`original_input_activation`	`Tensor`	The input activations of the original layer	`None`
`hessian`	`Tensor`	The Hessian matrix	`None`
`delta_hatX`	`Tensor`	The cross-term matrix	`None`

save_results ¶

save_results(filepath)

Save the quantization results to a file.

Saves self.results (a dict mapping layer name -> QuantizationResult) to a file.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	The path to save the results. The .pt extension is recommended.	required

Example

quantizer.save_results("quantization_results.pt")

load_results ¶

load_results(filepath, weights_only=False)

Load the quantization results from a file into self.results.

Loads saved quantization results and stores them in self.results.

Parameters:

Name	Type	Description	Default
`filepath`	`str`	The path to load the results from.	required
`weights_only`	`bool`	If True, only load tensor weights (safer but limited). Default is False to support loading QuantizationResult objects.	`False`

Returns:

Name	Type	Description
`dict`		A dict mapping layer name -> QuantizationResult (same reference as self.results).

Example

quantizer = JointQ() quantizer.load_results("quantization_results.pt") for layer_name, result in quantizer.results.items(): ... print(f"{layer_name}: {result.dequantized_weight.shape}")

Note

Backward compatibility: - QuantizationResult subclasses (JointQResult, GPTQResult, etc.) require the same class definitions to be available when loading. - Loading files saved with older versions may fail if class definitions have changed.

apply_results_to_model ¶

apply_results_to_model(model, **kwargs)

Replace Linear layers in model with quantized inference layers from self.results.

Call load_results(filepath) before this, or ensure self.results is already populated. **kwargs are passed to create_inference_layer (e.g. pack_weights=True for GPTQ).

Parameters:

Name	Type	Description	Default
`model`	`Module`	The model to modify (in place). Typically the base model loaded with from_pretrained() so that state_dict keys match.	required

Returns:

Type	Description
	None (modifies model in place)

Example

quantizer.load_results("quantization_results.pt") quantizer.apply_results_to_model(model, pack_weights=True)

get_quant_config ¶

get_quant_config() -> dict

Return quantization_config dict for saving (used by save_quantized_model).

Returns the content stored in model.config.quantization_config. Override this in quantizers that support save_quantized_model.

Returns:

Name	Type	Description
`dict`	`dict`	Config dict including `quant_method`

Raises:

Type	Description
`NotImplementedError`	If this quantizer does not support save_quantized_model

create_inference_layer ¶

create_inference_layer(result: QuantizationResult, linear_module: Linear, **kwargs) -> Linear

Build an inference layer from one entry in quantizer.results (used by save_quantized_model).

Override in quantizers that support save_quantized_model; call from_quantization_result on the method's inference layer class and return it.

Parameters:

Name	Type	Description	Default
`result`	`QuantizationResult`	One entry from quantizer.results[name] (a QuantizationResult subclass)	required
`linear_module`	`Linear`	Original Linear layer (used to get bias / device)	required
`**kwargs`		Method-specific options (e.g. pack_weights, use_gemlite)	`{}`

Returns:

Name	Type	Description
`Linear`	`Linear`	Quantized inference layer (nn.Module)

Raises:

Type	Description
`NotImplementedError`	If this quantizer does not support save_quantized_model

QuantizationResult¶

Base dataclass for quantization results returned by each layer.

QuantizationResult `dataclass` ¶

QuantizationResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None)

Base class for quantization results.

Each quantization method inherits from this class and adds method-specific parameters as fields.

Attributes:

Name	Type	Description
`dequantized_weight`	`Tensor`	Dequantized weights (FP16). None when compute_dequantized_weight() is overridden by subclass.
`quantization_time`	`float`	Time taken for quantization (seconds).
`output_squared_error`	`float`	Output squared error (when calc_quant_error=True).
`mean_output_squared_error`	`float`	Output mean squared error (when calc_quant_error=True).
`weight_squared_error`	`float`	Weight squared error (when calc_quant_error=True).
`mean_weight_squared_error`	`float`	Weight mean squared error (when calc_quant_error=True).
`relative_output_squared_error`	`float`	Output relative squared error \|\|WX^T - ŴX^T\|\|²_F / \|\|WX^T\|\|²_F (when calc_quant_error=True).
`relative_weight_squared_error`	`float`	Weight relative squared error \|\|W - Ŵ\|\|²_F / \|\|W\|\|²_F (when calc_quant_error=True).

compute_dequantized_weight ¶

compute_dequantized_weight(device: device = None) -> torch.Tensor

Compute and return the dequantized weight.

Subclasses should override this to recompute dequantized weights from quantization parameters (scale, zero_point, assignment, etc.).

Parameters:

Name	Type	Description	Default
`device`	`device`	Device to perform computation on. If None, computation is performed on the device where quantization parameters reside.	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Dequantized weight tensor (FP16, CPU).

ResultLoader¶

Loader for reading saved quantization results without performing quantization.

ResultLoader `dataclass` ¶

ResultLoader(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = tuple(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False, results_file: str = None, weights_only: bool = False)

Bases: Quantizer

Loader for reading saved quantization results.

Does not perform any quantization (setup() selects 0 target layers). Primary use case is loading pre-saved results.

Example

from onecomp.quantizer import ResultLoader loader = ResultLoader(results_file="quantization_results.pt") loader.results.keys()

setup ¶

setup(model)

Select no layers (no-op).

Ensures module_to_name is empty, since Runner is expected to call setup().

quantize_layer ¶

quantize_layer(module, input=None, hessian=None) -> Union[torch.Tensor, QuantizationResult]

Raise error if called.

ResultLoader is not intended to perform quantization, so calling this method raises an error.

Base Classes¶

Quantizer¶

Quantizer Feature Support¶

Quantizer dataclass ¶

Example 1: Exclude lm_head (default behavior)¶

Example 2: Quantize only specific layers¶

Example 3: Quantize layers containing specific keywords¶

Example 4: Exclude layers containing specific keywords¶

__post_init__ ¶

setup ¶

quantize ¶

quantize_layer abstractmethod ¶

quantize_with_qep ¶

save_results ¶

load_results ¶

apply_results_to_model ¶

get_quant_config ¶

create_inference_layer ¶

QuantizationResult¶

QuantizationResult dataclass ¶

compute_dequantized_weight ¶

ResultLoader¶

ResultLoader dataclass ¶

setup ¶

quantize_layer ¶

Quantizer `dataclass` ¶

quantize_layer `abstractmethod` ¶

QuantizationResult `dataclass` ¶

ResultLoader `dataclass` ¶