Base Classes¶
Quantizer¶
Abstract base class for all quantizers. Defines the common interface and shared functionality.
Quantizer Feature Support¶
Runner.save_quantized_model(), Runner.create_quantized_model(), and quantized-model
PPL/ACC evaluation internally call get_quant_config() and create_inference_layer() on
the quantizer. These methods raise NotImplementedError by default and must be overridden
by each quantizer to enable these features.
| Quantizer | get_quant_config |
create_inference_layer |
Save | Quantized PPL/ACC |
|---|---|---|---|---|
GPTQ |
Yes | Yes | Yes | Yes |
DBF |
Yes | Yes | Yes | Yes |
AutoBitQuantizer |
Yes | Yes | Yes | Yes |
RTN |
— | — | No | No (fallback) |
JointQ |
— | — | No | No (fallback) |
QUIP |
— | — | No | No (fallback) |
CQ |
— | — | No | No (fallback) |
ARB |
— | — | No | No (fallback) |
QBB |
— | — | No | No (fallback) |
Onebit |
— | — | No | No (fallback) |
For quantizers without support:
- PPL/ACC evaluation:
calculate_perplexity()/calculate_accuracy()withquantized_model=Trueautomatically falls back to the dequantized (FP16) model. No error is raised. - Saving: use
save_dequantized_model()(FP16) orsave_quantization_results()to persist results.
Quantizer
dataclass
¶
Quantizer(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = (lambda: (Linear,))(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False)
Abstract base class for quantizers
Attributes:
| Name | Type | Description |
|---|---|---|
num_layers |
int
|
The number of layers to be quantized, if None, all layers will be quantized. |
calc_quant_error |
bool
|
If True, calculate quantization error. |
exclude_layer_names |
list[str]
|
List of layer names to exclude from quantization (exact match). |
include_layer_names |
list[str]
|
List of layer names to include for quantization (exact match). If None, all layers are candidates. |
include_layer_keywords |
list[str]
|
List of keywords to match layer names for quantization. If any keyword is contained in the layer name, the layer is included. If None, all layers are candidates. |
exclude_layer_keywords |
list[str]
|
List of keywords to exclude layer names from quantization. If any keyword is contained in the layer name, the layer is excluded. |
target_layer_types |
tuple
|
Tuple of layer types to quantize. Default is (Linear,). |
Layer Selection Priority
- target_layer_types: Filter by layer type
- include_layer_names: If specified, only include exact matches
- include_layer_keywords: If specified, only include layers containing any keyword
- exclude_layer_names: Exclude exact matches
- exclude_layer_keywords: Exclude layers containing any keyword
- num_layers: Limit the maximum number of layers
To create a new Quantizer: - Inherit from this class. - Must implement the following method: quantize_layer - Set flag_calibration to True if calibration data is needed - Set flag_hessian to True if the Hessian matrix is needed
Examples:
Example 1: Exclude lm_head (default behavior)¶
quantizer = GPTQ(exclude_layer_names=["lm_head"])
Example 2: Quantize only specific layers¶
quantizer = GPTQ( include_layer_names=["model.layers.0.self_attn.q_proj"] )
Example 3: Quantize layers containing specific keywords¶
quantizer = GPTQ( include_layer_keywords=["q_proj", "k_proj", "v_proj"] )
Example 4: Exclude layers containing specific keywords¶
quantizer = GPTQ( exclude_layer_keywords=["down_proj", "gate_proj"] )
setup ¶
Setup the quantizer with the model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
The model to be quantized |
required |
quantize ¶
Quantize the layer
This method is called by the register_forward_hook method of the layer.
quantize_layer
abstractmethod
¶
Quantize the layer
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
module
|
Module
|
The layer module |
required |
input
|
tuple or Tensor
|
The input to the layer |
None
|
hessian
|
Tensor
|
The Hessian matrix |
None
|
Returns:
| Type | Description |
|---|---|
Union[Tensor, QuantizationResult]
|
Union[torch.Tensor, QuantizationResult]: Dequantized weight (torch.Tensor), or a QuantizationResult object. |
quantize_with_qep ¶
quantize_with_qep(module, quant_input_activation, original_input_activation=None, percdamp=0.01, perccorr=0.5, hessian=None, delta_hatX=None)
Quantize the layer with QEP
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
module
|
Module
|
The layer module |
required |
quant_input_activation
|
Tensor
|
The input activations of the quantized layer |
required |
original_input_activation
|
Tensor
|
The input activations of the original layer |
None
|
hessian
|
Tensor
|
The Hessian matrix |
None
|
delta_hatX
|
Tensor
|
The cross-term matrix |
None
|
save_results ¶
Save the quantization results to a file.
Saves self.results (a dict mapping layer name -> QuantizationResult) to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
The path to save the results. The .pt extension is recommended. |
required |
Example
quantizer.save_results("quantization_results.pt")
load_results ¶
Load the quantization results from a file into self.results.
Loads saved quantization results and stores them in self.results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
The path to load the results from. |
required |
weights_only
|
bool
|
If True, only load tensor weights (safer but limited). Default is False to support loading QuantizationResult objects. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
A dict mapping layer name -> QuantizationResult (same reference as self.results). |
Example
quantizer = JointQ() quantizer.load_results("quantization_results.pt") for layer_name, result in quantizer.results.items(): ... print(f"{layer_name}: {result.dequantized_weight.shape}")
Note
Backward compatibility: - QuantizationResult subclasses (JointQResult, GPTQResult, etc.) require the same class definitions to be available when loading. - Loading files saved with older versions may fail if class definitions have changed.
apply_results_to_model ¶
Replace Linear layers in model with quantized inference layers from self.results.
Call load_results(filepath) before this, or ensure self.results is already populated. **kwargs are passed to create_inference_layer (e.g. pack_weights=True for GPTQ).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
The model to modify (in place). Typically the base model loaded with from_pretrained() so that state_dict keys match. |
required |
Returns:
| Type | Description |
|---|---|
|
None (modifies model in place) |
Example
quantizer.load_results("quantization_results.pt") quantizer.apply_results_to_model(model, pack_weights=True)
get_quant_config ¶
Return quantization_config dict for saving (used by save_quantized_model).
Returns the content stored in model.config.quantization_config. Override this in quantizers that support save_quantized_model.
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Config dict including |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If this quantizer does not support save_quantized_model |
create_inference_layer ¶
Build an inference layer from one entry in quantizer.results (used by save_quantized_model).
Override in quantizers that support save_quantized_model; call from_quantization_result on the method's inference layer class and return it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
QuantizationResult
|
One entry from quantizer.results[name] (a QuantizationResult subclass) |
required |
linear_module
|
Linear
|
Original Linear layer (used to get bias / device) |
required |
**kwargs
|
Method-specific options (e.g. pack_weights, use_gemlite) |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Linear |
Linear
|
Quantized inference layer (nn.Module) |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If this quantizer does not support save_quantized_model |
QuantizationResult¶
Base dataclass for quantization results returned by each layer.
QuantizationResult
dataclass
¶
QuantizationResult(dequantized_weight: Tensor = None, quantization_time: float = None, output_squared_error: float = None, mean_output_squared_error: float = None, weight_squared_error: float = None, mean_weight_squared_error: float = None, relative_output_squared_error: float = None, relative_weight_squared_error: float = None)
Base class for quantization results.
Each quantization method inherits from this class and adds method-specific parameters as fields.
Attributes:
| Name | Type | Description |
|---|---|---|
dequantized_weight |
Tensor
|
Dequantized weights (FP16). None when compute_dequantized_weight() is overridden by subclass. |
quantization_time |
float
|
Time taken for quantization (seconds). |
output_squared_error |
float
|
Output squared error (when calc_quant_error=True). |
mean_output_squared_error |
float
|
Output mean squared error (when calc_quant_error=True). |
weight_squared_error |
float
|
Weight squared error (when calc_quant_error=True). |
mean_weight_squared_error |
float
|
Weight mean squared error (when calc_quant_error=True). |
relative_output_squared_error |
float
|
Output relative squared error ||WX^T - ŴX^T||²_F / ||WX^T||²_F (when calc_quant_error=True). |
relative_weight_squared_error |
float
|
Weight relative squared error ||W - Ŵ||²_F / ||W||²_F (when calc_quant_error=True). |
compute_dequantized_weight ¶
Compute and return the dequantized weight.
Subclasses should override this to recompute dequantized weights from quantization parameters (scale, zero_point, assignment, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
device
|
Device to perform computation on. If None, computation is performed on the device where quantization parameters reside. |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
torch.Tensor: Dequantized weight tensor (FP16, CPU). |
ResultLoader¶
Loader for reading saved quantization results without performing quantization.
ResultLoader
dataclass
¶
ResultLoader(name: str = None, num_layers: int = None, calc_quant_error: bool = False, include_layer_names: list[str] = None, exclude_layer_names: list[str] = (lambda: ['lm_head'])(), include_layer_keywords: list[str] = None, exclude_layer_keywords: list[str] = None, target_layer_types: tuple = tuple(), hessian_dtype: dtype = torch.float32, module_to_name: dict = dict(), results: dict = dict(), flag_calibration: bool = False, flag_hessian: bool = False, flag_xtx: bool = False, results_file: str = None, weights_only: bool = False)
Bases: Quantizer
Loader for reading saved quantization results.
Does not perform any quantization (setup() selects 0 target layers).
Primary use case is loading pre-saved results.
Example
from onecomp.quantizer import ResultLoader loader = ResultLoader(results_file="quantization_results.pt") loader.results.keys()