Runner¶
The Runner class is the main entry point for executing quantization pipelines in OneComp.
Runner ¶
Runner(model_config=None, calibration_dataset=None, max_length=2048, num_calibration_samples=512, quantizer=None, quantizers=None, qep=False, qep_config=None, calibration_strategy='drop_rand', calibration_seed=0, multi_gpu=False, gpu_ids=None, calibration_batch_size=None, num_layers_per_group=7, post_processes=None)
Runner class for model quantization
Runner class for executing quantization. Supports quantization using calibration data and parallel quantization on multiple GPUs.
Examples:
Single GPU quantization (default):
>>> from onecomp import Runner, ModelConfig
>>> from onecomp.quantizer.gptq import GPTQ
>>> model_config = ModelConfig(model_id_or_path="meta-llama/Llama-2-7b-hf")
>>> quantizer = GPTQ(wbits=4, groupsize=128)
>>> runner = Runner(
... model_config=model_config,
... quantizer=quantizer,
... )
>>> runner.run()
Multi-GPU quantization (layer-wise parallel):
>>> from onecomp.quantizer.jointq import JointQ
>>> quantizer = JointQ(bits=4, group_size=128)
>>> # Use all available GPUs
>>> runner = Runner(
... model_config=model_config,
... quantizer=quantizer,
... multi_gpu=True,
... )
>>> runner.run()
>>> # Use specific GPUs (e.g., GPU 0, 2, 3)
>>> runner = Runner(
... model_config=model_config,
... quantizer=quantizer,
... multi_gpu=True,
... gpu_ids=[0, 2, 3],
... )
>>> runner.run()
init method
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Model configuration. Required. |
None
|
calibration_dataset
|
Dataset
|
|
None
|
max_length
|
int
|
The maximum length of the input sequence. |
2048
|
num_calibration_samples
|
int
|
The number of calibration samples to use when loading default dataset. |
512
|
quantizer
|
Quantizer
|
The quantizer to use. Specify either |
None
|
quantizers
|
list[Quantizer]
|
Specify multiple quantizers. When used with
calibration_batch_size, the X^T X accumulation is shared,
reducing the forward pass to a single execution.
Specify either |
None
|
qep
|
bool
|
Whether to use QEP. |
False
|
qep_config
|
QEPConfig or None
|
Configuration for QEP. If None and |
None
|
calibration_strategy
|
str
|
Strategy for preparing calibration inputs. Default is "drop_rand". Available strategies: - "concat_chunk": Concatenate all texts, tokenize once, and split into fixed-length chunks (max_length). Creates as many chunks as possible from the data. - "concat_chunk_align": Same as concat_chunk, but adjusts the number of loaded samples so that num_chunks == num_calibration_samples. This ensures consistent token counts across experiments. - "drop_head": No cross-document mixing. Tokenize each document independently; drop samples with token length < max_length; take the head window (first max_length tokens). - "drop_rand": Same as above, but take a random window of length max_length from each long document (reproducible with calibration_seed). |
'drop_rand'
|
calibration_seed
|
int
|
Random seed used by some calibration strategies (e.g., "drop_rand"). |
0
|
multi_gpu
|
bool
|
Whether to use multi-GPU for layer-wise parallel quantization. Default is False. |
False
|
gpu_ids
|
list[int]
|
List of GPU IDs to use for multi-GPU quantization. If None and multi_gpu is True, all available GPUs will be used. |
None
|
calibration_batch_size
|
int or None
|
Batch size (number of sentences) for chunked calibration forward passes. Default is None (all calibration data in a single forward pass). When set to a positive integer (e.g., 128), calibration data is split into chunks of this size and forwarded in multiple passes to reduce GPU memory usage. The necessary statistics (e.g., X^T X for Hessian-based methods) are accumulated across chunks. This is mathematically exact, not an approximation. |
None
|
num_layers_per_group
|
int
|
Number of layers to process simultaneously in chunked calibration mode. Default is 7 (one Transformer block for Llama-like architectures: q,k,v,o,gate,up,down). Controls the trade-off between CPU memory usage for X^T X storage and the number of forward passes required. Only used when calibration_batch_size is set. |
7
|
post_processes
|
list[PostQuantizationProcess] or None
|
Optional list of post-quantization processes to execute
after the main quantization step. Each process receives
a quantized model on CPU (built via
|
None
|
Note
For zero-config quantization (VRAM auto-estimation +
AutoBitQuantizer + QEP), use the class method
:meth:auto_run instead.
Examples:
Chunked calibration with GPTQ (large-scale calibration data):
>>> from onecomp import Runner, ModelConfig
>>> from onecomp.quantizer.gptq import GPTQ
>>> model_config = ModelConfig(
... model_id_or_path="meta-llama/Llama-2-7b-hf"
... )
>>> quantizer = GPTQ(wbits=4, groupsize=128)
>>> runner = Runner(
... model_config=model_config,
... quantizer=quantizer,
... max_length=2048,
... num_calibration_samples=1024,
... calibration_batch_size=128, # Forward 128 sentences at a time
... )
>>> runner.run()
With custom num_layers_per_group:
>>> # When memory is sufficient: process 2 blocks (14 layers) simultaneously
>>> runner = Runner(
... model_config=model_config,
... quantizer=quantizer,
... max_length=2048,
... num_calibration_samples=1024,
... calibration_batch_size=128,
... num_layers_per_group=14,
... )
>>> runner.run()
Multiple quantizers (benchmark comparison):
>>> from onecomp.quantizer.gptq import GPTQ
>>> from onecomp.quantizer.jointq import JointQ
>>> gptq = GPTQ(wbits=4, groupsize=128, calc_quant_error=True)
>>> jointq = JointQ(bits=4, group_size=128, calc_quant_error=True,
... device=torch.device(0))
>>> runner = Runner(
... model_config=model_config,
... quantizers=[gptq, jointq],
... max_length=2048,
... num_calibration_samples=1024,
... calibration_batch_size=128,
... )
>>> runner.run()
>>> # Results are stored in gptq.results and jointq.results respectively
auto_run
classmethod
¶
auto_run(model_id: str, wbits: Optional[float] = None, total_vram_gb: Optional[float] = None, groupsize: int = 128, device: str = 'cuda:0', qep: bool = True, evaluate: bool = True, eval_original_model: bool = False, save_dir: str = 'auto', **kwargs)
One-liner quantization with sensible defaults.
Sets up ModelConfig, AutoBitQuantizer (ILP-based mixed-precision),
and QEP, then runs quantization. When wbits is None,
the target bitwidth is estimated automatically from available VRAM.
Optionally evaluates perplexity and accuracy, and saves the
quantized model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
Hugging Face model ID or local path. |
required |
wbits
|
float or None
|
Target quantization bitwidth.
When |
None
|
total_vram_gb
|
float or None
|
Total VRAM budget in GB for
bitwidth estimation. Only used when |
None
|
groupsize
|
int
|
GPTQ group size (default: 128). Use -1 to disable grouping. |
128
|
device
|
str
|
Device to place the model on (default: "cuda:0"). |
'cuda:0'
|
qep
|
bool
|
Whether to use QEP (default: True). |
True
|
evaluate
|
bool
|
Whether to calculate perplexity and accuracy after quantization (default: True). |
True
|
eval_original_model
|
bool
|
Whether to also evaluate the original (unquantized) model (default: False). |
False
|
save_dir
|
str or None
|
Directory to save the quantized model.
|
'auto'
|
**kwargs
|
Additional keyword arguments forwarded to the
|
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Runner |
The configured Runner instance (with quantization |
|
|
results accessible via |
Examples:
Minimal usage (QEP + GPTQ 4-bit, groupsize=128, auto-save):
>>> from onecomp import Runner
>>> runner = Runner.auto_run(
... model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
... )
Custom save directory:
>>> runner = Runner.auto_run(
... model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
... save_dir="./my_quantized_model",
... )
Skip saving:
>>> runner = Runner.auto_run(
... model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
... save_dir=None,
... )
Evaluate both original and quantized models:
check ¶
Check the settings
Performs the following checks:
model_configis aModelConfiginstance- Mutual exclusion check for
quantizerandquantizers(cannot specify both) - Type check for
quantizer/quantizers(must beQuantizerinstances) - At least one of them must be specified
- Parameter combination consistency check (see table below)
- When
multi_gpu=True,quantizer.flag_calibration=Truemust hold
Valid parameter combinations:
=========== ==== ========== ====================== quantizers qep multi_gpu calibration_batch_size =========== ==== ========== ====================== Specified False False Specified None True False None None False True None None False False Specified None False False None =========== ==== ========== ======================
Note
multi_gpu=True requires a quantizer with flag_calibration=True.
Raises:
| Type | Description |
|---|---|
TypeError
|
Invalid type for |
ValueError
|
Invalid parameter combination |
calculate_perplexity ¶
calculate_perplexity(original_model=False, dequantized_model=False, quantized_model=True, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048, quantizer=None)
Calculate the perplexity of the model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_model
|
bool
|
Whether to calculate the perplexity of the original model. |
False
|
dequantized_model
|
bool
|
Whether to calculate the perplexity of the dequantized model. |
False
|
quantized_model
|
bool
|
Whether to calculate the perplexity of the quantized model. |
True
|
dataset_name
|
str
|
The name of the dataset to use for calculating perplexity. |
'wikitext'
|
dataset_config
|
str
|
The configuration of the dataset. |
'wikitext-2-raw-v1'
|
split
|
str
|
The split of the dataset to use. |
'test'
|
max_samples
|
int
|
The maximum number of samples to use. |
None
|
max_length
|
int
|
Maximum length of the sliding window. Uses model.config.max_position_embeddings if None. 2048 is recommended to match standard paper values. |
2048
|
stride
|
int
|
Stride of the sliding window. Same as max_length (no overlap) if None. |
2048
|
quantizer
|
Quantizer
|
The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
(original_ppl, dequantized_ppl, quantized_ppl) |
Note
Evaluating the original or dequantized model requires loading the full model on GPU.
Quantized-model evaluation (quantized_model=True) is
currently supported only for GPTQ and DBF quantizers.
Support for other quantization methods is planned.
Examples:
Single quantizer mode:
Multiple quantizers mode:
benchmark_perplexity ¶
benchmark_perplexity(original_model=True, dequantized_model=False, quantized_model=True, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048, quantizers=None)
Calculate perplexity for all quantizers at once
Internally calls calculate_perplexity for each quantizer. The original model PPL is calculated only once (on the first iteration).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_model
|
bool
|
Whether to calculate the perplexity of the original model. |
True
|
dequantized_model
|
bool
|
Whether to calculate the perplexity of the dequantized model. |
False
|
quantized_model
|
bool
|
Whether to calculate the perplexity of the quantized model. |
True
|
dataset_name
|
str
|
The name of the dataset to use for calculating perplexity. |
'wikitext'
|
dataset_config
|
str
|
The configuration of the dataset. |
'wikitext-2-raw-v1'
|
split
|
str
|
The split of the dataset to use. |
'test'
|
max_samples
|
int
|
The maximum number of samples to use. |
None
|
max_length
|
int
|
Maximum length of the sliding window. Uses model.config.max_position_embeddings if None. |
2048
|
stride
|
int
|
Stride of the sliding window. Same as max_length (no overlap) if None. |
2048
|
quantizers
|
list[Quantizer]
|
List of quantizers. Uses self.quantizers or [self.quantizer] if None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dictionary of PPL values. Keys are as follows: |
|
|
||
|
||
|
Examples:
>>> runner.run()
>>> ppl_dict = runner.benchmark_perplexity()
>>> print(ppl_dict)
{'original': 5.47, 'GPTQ': 5.72, 'JointQ': 5.68}
Specify quantizers explicitly:
Include dequantized model PPL:
calculate_accuracy ¶
calculate_accuracy(original_model=False, dequantized_model=False, quantized_model=True, tasks=None, batch_size=8, num_fewshot=0, display_results=True, quantizer=None)
Calculate the zero-shot accuracy of the model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_model
|
bool
|
Whether to calculate the accuracy of the original model. |
False
|
dequantized_model
|
bool
|
Whether to calculate the accuracy of the dequantized model. |
False
|
quantized_model
|
bool
|
Whether to calculate the accuracy of the quantized model. |
True
|
tasks
|
list
|
The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"] |
None
|
batch_size
|
int
|
The batch size for evaluation. |
8
|
num_fewshot
|
int
|
The number of few-shot examples. |
0
|
display_results
|
bool
|
Whether to display the results. |
True
|
quantizer
|
Quantizer
|
The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
(original_acc, dequantized_acc, quantized_acc) |
Note
Evaluating the original or dequantized model requires loading the full model on GPU.
Quantized-model evaluation (quantized_model=True) is
currently supported only for GPTQ and DBF quantizers.
Support for other quantization methods is planned.
Examples:
Single quantizer mode:
Multiple quantizers mode:
benchmark_accuracy ¶
benchmark_accuracy(original_model=True, dequantized_model=False, quantized_model=True, tasks=None, batch_size=8, num_fewshot=0, display_results=False, quantizers=None)
Calculate accuracy for all quantizers at once
Internally calls calculate_accuracy for each quantizer. The original model accuracy is calculated only once (on the first iteration).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original_model
|
bool
|
Whether to calculate the accuracy of the original model. |
True
|
dequantized_model
|
bool
|
Whether to calculate the accuracy of the dequantized model. |
False
|
quantized_model
|
bool
|
Whether to calculate the accuracy of the quantized model. |
True
|
tasks
|
list
|
The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"] |
None
|
batch_size
|
int
|
The batch size for evaluation. |
8
|
num_fewshot
|
int
|
The number of few-shot examples. |
0
|
display_results
|
bool
|
Whether to display the results. |
False
|
quantizers
|
list[Quantizer]
|
List of quantizers. Uses self.quantizers or [self.quantizer] if None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dictionary of accuracy values. Keys are as follows: |
|
|
||
|
||
|
Examples:
>>> runner.run()
>>> acc_dict = runner.benchmark_accuracy()
>>> print(acc_dict)
{'original': {...}, 'GPTQ': {...}, 'JointQ': {...}}
Specify quantizers explicitly:
Include dequantized model accuracy:
print_quantization_results ¶
Log quantization results.
Formats and logs the quantizer results. The following information is output for each layer:
- Quantization time (seconds)
- Output squared error (only if value exists)
- Mean output squared error (only if value exists)
- Weight squared error (only if value exists)
- Mean weight squared error (only if value exists)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
quantizer
|
Quantizer
|
The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Examples:
Single quantizer mode:
Multiple quantizers mode:
save_quantization_statistics ¶
Save the quantization statistics
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
File path to save to |
required |
quantizer
|
Quantizer
|
Quantizer whose statistics to save. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Examples:
Single quantizer mode:
Multiple quantizers mode:
save_quantization_results ¶
Save the quantization results to a file
Save quantization results (QuantizationResult objects) to a file. The saved data includes dequantized weights, scales, zero points, integer assignments, and other quantization parameters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The path to save the quantization results. The .pt extension is recommended. |
required |
quantizer
|
Quantizer
|
Quantizer whose results to save. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Examples:
Single quantizer mode:
Multiple quantizers mode:
save_dequantized_model ¶
Save the dequantized model to the specified path
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The path to save the dequantized model. |
required |
quantizer
|
Quantizer
|
The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Examples:
Single quantizer mode:
Multiple quantizers mode:
save_quantized_model ¶
Save the quantized model to the specified directory
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_directory
|
str
|
The path to save the quantized model. |
required |
pack_weights
|
bool
|
Whether to pack quantized weights for more memory/storage-efficient representation. |
True
|
Examples:
Single quantizer mode:
create_quantized_model ¶
Create a quantized model from quantization results.
Loads the base model on CPU, replaces Linear layers with quantized
inference layers (e.g. GPTQLinear), and attaches quantization
config to model.config.
Must be called after run() (i.e., quantizer.results must
be populated).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pack_weights
|
bool
|
Whether to pack quantized weights for memory-efficient representation. Default is True. |
True
|
quantizer
|
Quantizer
|
The quantizer to use. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
use_gemlite
|
bool or None
|
Whether to use GemLite for inference layers. Set to False when saving to avoid extra params in safetensors. Default is None (uses quantizer default). |
None
|
Returns:
| Type | Description |
|---|---|
|
tuple[nn.Module, PreTrainedTokenizer]: (quantized_model, tokenizer) |
Examples:
With post-process:
save_quantized_model_pt ¶
Save the quantized model as a PyTorch .pt file.
Use this method to save models that include post-processing
modifications (e.g. LoRA adapters from PostProcessLoraSFT).
The entire model object is serialized with torch.save,
preserving custom module types such as LoRAGPTQLinear.
For models without post-processing, prefer
save_quantized_model which uses the HF-compatible
safetensors format.
The saved directory contains:
- model.pt: The model (torch.save)
- Tokenizer files (via save_pretrained)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_directory
|
str
|
The path to save the model. |
required |
See Also
:func:onecomp.load_quantized_model_pt to load models
saved by this method.
Examples:
analyze_cumulative_error ¶
analyze_cumulative_error(layer_keywords=None, plot_path=None, json_path=None, batch_keywords=False, quantizer=None)
Analyze cumulative quantization error for each linear layer.
Cumulative error: ||W_orig X_orig - W_quant X_quant||^2_F
Note
Must be used after calling the run() method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layer_keywords
|
List of keywords to filter layers. Each keyword is analyzed and plotted separately. Default: ["mlp.down_proj"] Example: ["q_proj", "k_proj"] |
None
|
|
plot_path
|
Base path to save plots. Keyword is inserted before extension. Example: "error.png" -> "error_mlp.down_proj.png" |
None
|
|
json_path
|
Path to save results as JSON file. Example: "cumulative_error.json" |
None
|
|
batch_keywords
|
If True, process all keywords in a single forward pass. This is faster but uses more CPU memory because all target layers' outputs are stored simultaneously. If False (default), process each keyword separately with model reload per keyword. This uses less CPU memory but incurs overhead from repeated model loading and forward passes. |
False
|
|
quantizer
|
Quantizer
|
The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
keyword -> {layer_name -> cumulative squared error} |
Examples:
Single quantizer mode:
>>> results = runner.analyze_cumulative_error()
>>> results = runner.analyze_cumulative_error(plot_path="cumulative_error.png")
Multiple quantizers mode:
prepare_calibration_dataset ¶
Prepare calibration data for quantization methods such as GPTQ.
See utils.calibration.prepare_calibration_dataset for details.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
device
|
Device to place tensors on (CPU or GPU) |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Input dictionary for the model - "input_ids": tensor of shape (num_chunks, max_length) - "attention_mask": tensor of shape (num_chunks, max_length) |