Skip to content

Base Classes

Quantizer

Abstract base class for all quantizers. Defines the common interface and shared functionality.

Quantizer Feature Support

Runner.save_quantized_model(), Runner.create_quantized_model(), and quantized-model PPL/ACC evaluation internally call get_quant_config() and create_inference_layer() on the quantizer. These methods raise NotImplementedError by default and must be overridden by each quantizer to enable these features.

Quantizer get_quant_config create_inference_layer Save Quantized PPL/ACC
GPTQ Yes Yes Yes Yes
DBF Yes Yes Yes Yes
AutoBitQuantizer Yes Yes Yes Yes
RTN No No (fallback)
JointQ No No (fallback)
QUIP No No (fallback)
CQ No No (fallback)
ARB No No (fallback)
QBB No No (fallback)
Onebit No No (fallback)

For quantizers without support:

  • PPL/ACC evaluation: calculate_perplexity() / calculate_accuracy() with quantized_model=True automatically falls back to the dequantized (FP16) model. No error is raised.
  • Saving: use save_dequantized_model() (FP16) or save_quantization_results() to persist results.

Quantizer dataclass

Quantizer(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False)

Abstract base class for quantizers

Attributes:

Name Type Description
num_layers int

The number of layers to be quantized, if None, all layers will be quantized.

calc_quant_error bool

If True, calculate quantization error.

exclude_layer_names list[str]

List of layer names to exclude from quantization (exact match).

include_layer_names list[str]

List of layer names to include for quantization (exact match). If None, all layers are candidates.

include_layer_keywords list[str]

List of keywords to match layer names for quantization. If any keyword is contained in the layer name, the layer is included. If None, all layers are candidates.

exclude_layer_keywords list[str]

List of keywords to exclude layer names from quantization. If any keyword is contained in the layer name, the layer is excluded.

target_layer_types tuple

Tuple of layer types to quantize. Default is (Linear,).

Layer Selection Priority
  1. target_layer_types: Filter by layer type
  2. include_layer_names: If specified, only include exact matches
  3. include_layer_keywords: If specified, only include layers containing any keyword
  4. exclude_layer_names: Exclude exact matches
  5. exclude_layer_keywords: Exclude layers containing any keyword
  6. num_layers: Limit the maximum number of layers

To create a new Quantizer: - Inherit from this class. - Must implement the following method: quantize_layer - Set flag_calibration to True if calibration data is needed - Set flag_hessian to True if the Hessian matrix is needed

Examples:

Example 1: Exclude lm_head (default behavior)

quantizer = GPTQ(exclude_layer_names=["lm_head"])

Example 2: Quantize only specific layers

quantizer = GPTQ( include_layer_names=["model.layers.0.self_attn.q_proj"] )

Example 3: Quantize layers containing specific keywords

quantizer = GPTQ( include_layer_keywords=["q_proj", "k_proj", "v_proj"] )

Example 4: Exclude layers containing specific keywords

quantizer = GPTQ( exclude_layer_keywords=["down_proj", "gate_proj"] )

__post_init__

__post_init__()

post_init method

setup

setup(model)

Setup the quantizer with the model

Parameters:

Name Type Description Default
model

The model to be quantized

required

quantize

quantize(module, input, output)

Quantize the layer

This method is called by the register_forward_hook method of the layer.

quantize_layer abstractmethod

quantize_layer(module, input=None, hessian=None) -> Union[torch.Tensor, QuantizationResult]

Quantize the layer

Parameters:

Name Type Description Default
module Module

The layer module

required
input tuple or Tensor

The input to the layer

None
hessian Tensor

The Hessian matrix

None

Returns:

Type Description
Union[Tensor, QuantizationResult]

Union[torch.Tensor, QuantizationResult]: Dequantized weight (torch.Tensor), or a QuantizationResult object.

quantize_with_qep

quantize_with_qep(module, quant_input_activation, original_input_activation=None, percdamp=0.01, perccorr=0.5, hessian=None, delta_hatX=None)

Quantize the layer with QEP

Parameters:

Name Type Description Default
module Module

The layer module

required
quant_input_activation Tensor

The input activations of the quantized layer

required
original_input_activation Tensor

The input activations of the original layer

None
hessian Tensor

The Hessian matrix

None
delta_hatX Tensor

The cross-term matrix

None

save_results

save_results(filepath)

Save the quantization results to a file.

Saves self.results (a dict mapping layer name -> QuantizationResult) to a file.

Parameters:

Name Type Description Default
filepath str

The path to save the results. The .pt extension is recommended.

required
Example

quantizer.save_results("quantization_results.pt")

load_results

load_results(filepath, weights_only=False)

Load the quantization results from a file into self.results.

Loads saved quantization results and stores them in self.results.

Parameters:

Name Type Description Default
filepath str

The path to load the results from.

required
weights_only bool

If True, only load tensor weights (safer but limited). Default is False to support loading QuantizationResult objects.

False

Returns:

Name Type Description
dict

A dict mapping layer name -> QuantizationResult (same reference as self.results).

Example

quantizer = JointQ() quantizer.load_results("quantization_results.pt") for layer_name, result in quantizer.results.items(): ... print(f"{layer_name}: {result.dequantized_weight.shape}")

Note

Backward compatibility: - QuantizationResult subclasses (JointQResult, GPTQResult, etc.) require the same class definitions to be available when loading. - Loading files saved with older versions may fail if class definitions have changed.

apply_results_to_model

apply_results_to_model(model, **kwargs)

Replace Linear layers in model with quantized inference layers from self.results.

Call load_results(filepath) before this, or ensure self.results is already populated. **kwargs are passed to create_inference_layer (e.g. pack_weights=True for GPTQ).

Parameters:

Name Type Description Default
model Module

The model to modify (in place). Typically the base model loaded with from_pretrained() so that state_dict keys match.

required

Returns:

Type Description

None (modifies model in place)

Example

quantizer.load_results("quantization_results.pt") quantizer.apply_results_to_model(model, pack_weights=True)

get_quant_config

get_quant_config() -> dict

Return quantization_config dict for saving (used by save_quantized_model).

Returns the content stored in model.config.quantization_config. Override this in quantizers that support save_quantized_model.

Returns:

Name Type Description
dict dict

Config dict including quant_method

Raises:

Type Description
NotImplementedError

If this quantizer does not support save_quantized_model

create_inference_layer

create_inference_layer(result: QuantizationResult, linear_module: Linear, **kwargs) -> Linear

Build an inference layer from one entry in quantizer.results (used by save_quantized_model).

Override in quantizers that support save_quantized_model; call from_quantization_result on the method's inference layer class and return it.

Parameters:

Name Type Description Default
result QuantizationResult

One entry from quantizer.results[name] (a QuantizationResult subclass)

required
linear_module Linear

Original Linear layer (used to get bias / device)

required
**kwargs

Method-specific options (e.g. pack_weights, use_gemlite)

{}

Returns:

Name Type Description
Linear Linear

Quantized inference layer (nn.Module)

Raises:

Type Description
NotImplementedError

If this quantizer does not support save_quantized_model

QuantizationResult

Base dataclass for quantization results returned by each layer.

QuantizationResult dataclass

QuantizationResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None)

Base class for quantization results.

Each quantization method inherits from this class and adds method-specific parameters as fields.

Attributes:

Name Type Description
dequantized_weight Tensor

Dequantized weights (FP16). None when compute_dequantized_weight() is overridden by subclass.

quantization_time float

Time taken for quantization (seconds).

output_squared_error float

Output squared error (when calc_quant_error=True).

mean_output_squared_error float

Output mean squared error (when calc_quant_error=True).

weight_squared_error float

Weight squared error (when calc_quant_error=True).

mean_weight_squared_error float

Weight mean squared error (when calc_quant_error=True).

relative_output_squared_error float

Output relative squared error ||WX^T - ŴX^T||²_F / ||WX^T||²_F (when calc_quant_error=True).

relative_weight_squared_error float

Weight relative squared error ||W - Ŵ||²_F / ||W||²_F (when calc_quant_error=True).

compute_dequantized_weight

compute_dequantized_weight(device: device = None) -> torch.Tensor

Compute and return the dequantized weight.

Subclasses should override this to recompute dequantized weights from quantization parameters (scale, zero_point, assignment, etc.).

Parameters:

Name Type Description Default
device device

Device to perform computation on. If None, computation is performed on the device where quantization parameters reside.

None

Returns:

Type Description
Tensor

torch.Tensor: Dequantized weight tensor (FP16, CPU).

ResultLoader

Loader for reading saved quantization results without performing quantization.

ResultLoader dataclass

ResultLoader(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = tuple(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False, results_file: str = None, weights_only: bool = False)

Bases: Quantizer

Loader for reading saved quantization results.

Does not perform any quantization (setup() selects 0 target layers). Primary use case is loading pre-saved results.

Example

from onecomp.quantizer import ResultLoader loader = ResultLoader(results_file="quantization_results.pt") loader.results.keys()

setup

setup(model)

Select no layers (no-op).

Ensures module_to_name is empty, since Runner is expected to call setup().

quantize_layer

quantize_layer(module, input=None, hessian=None) -> Union[torch.Tensor, QuantizationResult]

Raise error if called.

ResultLoader is not intended to perform quantization, so calling this method raises an error.