Skip to content

Runner

The Runner class is the main entry point for executing quantization pipelines in OneComp.

Runner

Runner(model_config=None, calibration_dataset=None, max_length=2048, num_calibration_samples=512, quantizer=None, quantizers=None, qep=False, qep_config=None, calibration_strategy='drop_rand', calibration_seed=0, multi_gpu=False, gpu_ids=None, calibration_batch_size=None, num_layers_per_group=7, post_processes=None)

Runner class for model quantization

Runner class for executing quantization. Supports quantization using calibration data and parallel quantization on multiple GPUs.

Examples:

Single GPU quantization (default):

>>> from onecomp import Runner, ModelConfig
>>> from onecomp.quantizer.gptq import GPTQ
>>> model_config = ModelConfig(model_id_or_path="meta-llama/Llama-2-7b-hf")
>>> quantizer = GPTQ(wbits=4, groupsize=128)
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
... )
>>> runner.run()

Multi-GPU quantization (layer-wise parallel):

>>> from onecomp.quantizer.jointq import JointQ
>>> quantizer = JointQ(bits=4, group_size=128)
>>> # Use all available GPUs
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     multi_gpu=True,
... )
>>> runner.run()
>>> # Use specific GPUs (e.g., GPU 0, 2, 3)
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     multi_gpu=True,
...     gpu_ids=[0, 2, 3],
... )
>>> runner.run()

init method

Parameters:

Name Type Description Default
model_config ModelConfig

Model configuration. Required.

None
calibration_dataset Dataset
None
max_length int

The maximum length of the input sequence.

2048
num_calibration_samples int

The number of calibration samples to use when loading default dataset.

512
quantizer Quantizer

The quantizer to use. Specify either quantizer or quantizers, not both. At least one must be given.

None
quantizers list[Quantizer]

Specify multiple quantizers. When used with calibration_batch_size, the X^T X accumulation is shared, reducing the forward pass to a single execution. Specify either quantizer or quantizers, not both. Currently, this is only available when calibration_batch_size is set and qep=False.

None
qep bool

Whether to use QEP.

False
qep_config QEPConfig or None

Configuration for QEP. If None and qep=True, a default QEPConfig() is used.

None
calibration_strategy str

Strategy for preparing calibration inputs. Default is "drop_rand".

Available strategies: - "concat_chunk": Concatenate all texts, tokenize once, and split into fixed-length chunks (max_length). Creates as many chunks as possible from the data. - "concat_chunk_align": Same as concat_chunk, but adjusts the number of loaded samples so that num_chunks == num_calibration_samples. This ensures consistent token counts across experiments. - "drop_head": No cross-document mixing. Tokenize each document independently; drop samples with token length < max_length; take the head window (first max_length tokens). - "drop_rand": Same as above, but take a random window of length max_length from each long document (reproducible with calibration_seed).

'drop_rand'
calibration_seed int

Random seed used by some calibration strategies (e.g., "drop_rand").

0
multi_gpu bool

Whether to use multi-GPU for layer-wise parallel quantization. Default is False.

False
gpu_ids list[int]

List of GPU IDs to use for multi-GPU quantization. If None and multi_gpu is True, all available GPUs will be used.

None
calibration_batch_size int or None

Batch size (number of sentences) for chunked calibration forward passes. Default is None (all calibration data in a single forward pass). When set to a positive integer (e.g., 128), calibration data is split into chunks of this size and forwarded in multiple passes to reduce GPU memory usage. The necessary statistics (e.g., X^T X for Hessian-based methods) are accumulated across chunks. This is mathematically exact, not an approximation.

None
num_layers_per_group int

Number of layers to process simultaneously in chunked calibration mode. Default is 7 (one Transformer block for Llama-like architectures: q,k,v,o,gate,up,down). Controls the trade-off between CPU memory usage for X^T X storage and the number of forward passes required. Only used when calibration_batch_size is set.

7
post_processes list[PostQuantizationProcess] or None

Optional list of post-quantization processes to execute after the main quantization step. Each process receives a quantized model on CPU (built via create_quantized_model) and may modify it in-place. Processes are executed in order. Default is None.

None
Note

For zero-config quantization (VRAM auto-estimation + AutoBitQuantizer + QEP), use the class method :meth:auto_run instead.

Examples:

Chunked calibration with GPTQ (large-scale calibration data):

>>> from onecomp import Runner, ModelConfig
>>> from onecomp.quantizer.gptq import GPTQ
>>> model_config = ModelConfig(
...     model_id_or_path="meta-llama/Llama-2-7b-hf"
... )
>>> quantizer = GPTQ(wbits=4, groupsize=128)
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     max_length=2048,
...     num_calibration_samples=1024,
...     calibration_batch_size=128,  # Forward 128 sentences at a time
... )
>>> runner.run()

With custom num_layers_per_group:

>>> # When memory is sufficient: process 2 blocks (14 layers) simultaneously
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     max_length=2048,
...     num_calibration_samples=1024,
...     calibration_batch_size=128,
...     num_layers_per_group=14,
... )
>>> runner.run()

Multiple quantizers (benchmark comparison):

>>> from onecomp.quantizer.gptq import GPTQ
>>> from onecomp.quantizer.jointq import JointQ
>>> gptq = GPTQ(wbits=4, groupsize=128, calc_quant_error=True)
>>> jointq = JointQ(bits=4, group_size=128, calc_quant_error=True,
...                 device=torch.device(0))
>>> runner = Runner(
...     model_config=model_config,
...     quantizers=[gptq, jointq],
...     max_length=2048,
...     num_calibration_samples=1024,
...     calibration_batch_size=128,
... )
>>> runner.run()
>>> # Results are stored in gptq.results and jointq.results respectively

auto_run classmethod

auto_run(model_id: str, wbits: Optional[float] = None, total_vram_gb: Optional[float] = None, groupsize: int = 128, device: str = 'cuda:0', qep: bool = True, evaluate: bool = True, eval_original_model: bool = False, save_dir: str = 'auto', **kwargs)

One-liner quantization with sensible defaults.

Sets up ModelConfig, AutoBitQuantizer (ILP-based mixed-precision), and QEP, then runs quantization. When wbits is None, the target bitwidth is estimated automatically from available VRAM. Optionally evaluates perplexity and accuracy, and saves the quantized model.

Parameters:

Name Type Description Default
model_id str

Hugging Face model ID or local path.

required
wbits float or None

Target quantization bitwidth. When None (default), estimated from VRAM via estimate_wbits_from_vram.

None
total_vram_gb float or None

Total VRAM budget in GB for bitwidth estimation. Only used when wbits is None. When None, the installed GPU VRAM is detected automatically.

None
groupsize int

GPTQ group size (default: 128). Use -1 to disable grouping.

128
device str

Device to place the model on (default: "cuda:0").

'cuda:0'
qep bool

Whether to use QEP (default: True).

True
evaluate bool

Whether to calculate perplexity and accuracy after quantization (default: True).

True
eval_original_model bool

Whether to also evaluate the original (unquantized) model (default: False).

False
save_dir str or None

Directory to save the quantized model. "auto" (default) derives the path from model_id (e.g., "TinyLlama-1.1B-...-autobit-3.5bit"). Set to None to skip saving.

'auto'
**kwargs

Additional keyword arguments forwarded to the GPTQ constructor (e.g., actorder, sym).

{}

Returns:

Name Type Description
Runner

The configured Runner instance (with quantization

results accessible via runner.quantizer.results).

Examples:

Minimal usage (QEP + GPTQ 4-bit, groupsize=128, auto-save):

>>> from onecomp import Runner
>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
... )

Custom save directory:

>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
...     save_dir="./my_quantized_model",
... )

Skip saving:

>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
...     save_dir=None,
... )

Evaluate both original and quantized models:

>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
...     eval_original_model=True,
... )

run

run()

Execute quantization (and related) processing

check

check()

Check the settings

Performs the following checks:

  1. model_config is a ModelConfig instance
  2. Mutual exclusion check for quantizer and quantizers (cannot specify both)
  3. Type check for quantizer / quantizers (must be Quantizer instances)
  4. At least one of them must be specified
  5. Parameter combination consistency check (see table below)
  6. When multi_gpu=True, quantizer.flag_calibration=True must hold

Valid parameter combinations:

=========== ==== ========== ====================== quantizers qep multi_gpu calibration_batch_size =========== ==== ========== ====================== Specified False False Specified None True False None None False True None None False False Specified None False False None =========== ==== ========== ======================

Note

multi_gpu=True requires a quantizer with flag_calibration=True.

Raises:

Type Description
TypeError

Invalid type for model_config, quantizer, or quantizers

ValueError

Invalid parameter combination

calculate_perplexity

calculate_perplexity(original_model=False, dequantized_model=False, quantized_model=True, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048, quantizer=None)

Calculate the perplexity of the model

Parameters:

Name Type Description Default
original_model bool

Whether to calculate the perplexity of the original model.

False
dequantized_model bool

Whether to calculate the perplexity of the dequantized model.

False
quantized_model bool

Whether to calculate the perplexity of the quantized model.

True
dataset_name str

The name of the dataset to use for calculating perplexity.

'wikitext'
dataset_config str

The configuration of the dataset.

'wikitext-2-raw-v1'
split str

The split of the dataset to use.

'test'
max_samples int

The maximum number of samples to use.

None
max_length int

Maximum length of the sliding window. Uses model.config.max_position_embeddings if None. 2048 is recommended to match standard paper values.

2048
stride int

Stride of the sliding window. Same as max_length (no overlap) if None.

2048
quantizer Quantizer

The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Returns:

Name Type Description
tuple

(original_ppl, dequantized_ppl, quantized_ppl)

Note

Evaluating the original or dequantized model requires loading the full model on GPU.

Quantized-model evaluation (quantized_model=True) is currently supported only for GPTQ and DBF quantizers. Support for other quantization methods is planned.

Examples:

Single quantizer mode:

>>> original_ppl, dequantized_ppl, quantized_ppl = runner.calculate_perplexity()

Multiple quantizers mode:

>>> original_ppl, dequantized_ppl, quantized_ppl = runner.calculate_perplexity(
...     quantizer=gptq
... )

benchmark_perplexity

benchmark_perplexity(original_model=True, dequantized_model=False, quantized_model=True, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048, quantizers=None)

Calculate perplexity for all quantizers at once

Internally calls calculate_perplexity for each quantizer. The original model PPL is calculated only once (on the first iteration).

Parameters:

Name Type Description Default
original_model bool

Whether to calculate the perplexity of the original model.

True
dequantized_model bool

Whether to calculate the perplexity of the dequantized model.

False
quantized_model bool

Whether to calculate the perplexity of the quantized model.

True
dataset_name str

The name of the dataset to use for calculating perplexity.

'wikitext'
dataset_config str

The configuration of the dataset.

'wikitext-2-raw-v1'
split str

The split of the dataset to use.

'test'
max_samples int

The maximum number of samples to use.

None
max_length int

Maximum length of the sliding window. Uses model.config.max_position_embeddings if None.

2048
stride int

Stride of the sliding window. Same as max_length (no overlap) if None.

2048
quantizers list[Quantizer]

List of quantizers. Uses self.quantizers or [self.quantizer] if None.

None

Returns:

Name Type Description
dict

Dictionary of PPL values. Keys are as follows:

  • "original": PPL of the original model (not included if skipped)
  • quantizer.name: PPL for each quantizer (quantized or dequantized, with quantized taking precedence)
  • quantizer.name + "_dequantized": PPL of the dequantized model (only included when dequantized_model=True)

Examples:

>>> runner.run()
>>> ppl_dict = runner.benchmark_perplexity()
>>> print(ppl_dict)
{'original': 5.47, 'GPTQ': 5.72, 'JointQ': 5.68}

Specify quantizers explicitly:

>>> ppl_dict = runner.benchmark_perplexity(quantizers=[gptq, jointq])

Include dequantized model PPL:

>>> ppl_dict = runner.benchmark_perplexity(dequantized_model=True)
>>> print(ppl_dict)
{'original': 5.47, 'GPTQ': 5.72, 'GPTQ_dequantized': 5.71}

calculate_accuracy

calculate_accuracy(original_model=False, dequantized_model=False, quantized_model=True, tasks=None, batch_size=8, num_fewshot=0, display_results=True, quantizer=None)

Calculate the zero-shot accuracy of the model

Parameters:

Name Type Description Default
original_model bool

Whether to calculate the accuracy of the original model.

False
dequantized_model bool

Whether to calculate the accuracy of the dequantized model.

False
quantized_model bool

Whether to calculate the accuracy of the quantized model.

True
tasks list

The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"]

None
batch_size int

The batch size for evaluation.

8
num_fewshot int

The number of few-shot examples.

0
display_results bool

Whether to display the results.

True
quantizer Quantizer

The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Returns:

Name Type Description
tuple

(original_acc, dequantized_acc, quantized_acc)

Note

Evaluating the original or dequantized model requires loading the full model on GPU.

Quantized-model evaluation (quantized_model=True) is currently supported only for GPTQ and DBF quantizers. Support for other quantization methods is planned.

Examples:

Single quantizer mode:

>>> original_acc, dequantized_acc, quantized_acc = runner.calculate_accuracy()

Multiple quantizers mode:

>>> original_acc, dequantized_acc, quantized_acc = runner.calculate_accuracy(
...     quantizer=gptq
... )

benchmark_accuracy

benchmark_accuracy(original_model=True, dequantized_model=False, quantized_model=True, tasks=None, batch_size=8, num_fewshot=0, display_results=False, quantizers=None)

Calculate accuracy for all quantizers at once

Internally calls calculate_accuracy for each quantizer. The original model accuracy is calculated only once (on the first iteration).

Parameters:

Name Type Description Default
original_model bool

Whether to calculate the accuracy of the original model.

True
dequantized_model bool

Whether to calculate the accuracy of the dequantized model.

False
quantized_model bool

Whether to calculate the accuracy of the quantized model.

True
tasks list

The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"]

None
batch_size int

The batch size for evaluation.

8
num_fewshot int

The number of few-shot examples.

0
display_results bool

Whether to display the results.

False
quantizers list[Quantizer]

List of quantizers. Uses self.quantizers or [self.quantizer] if None.

None

Returns:

Name Type Description
dict

Dictionary of accuracy values. Keys are as follows:

  • "original": Accuracy of the original model (not included if skipped)
  • quantizer.name: Accuracy for each quantizer (quantized or dequantized, with quantized taking precedence)
  • quantizer.name + "_dequantized": Accuracy of the dequantized model (only included when dequantized_model=True)

Examples:

>>> runner.run()
>>> acc_dict = runner.benchmark_accuracy()
>>> print(acc_dict)
{'original': {...}, 'GPTQ': {...}, 'JointQ': {...}}

Specify quantizers explicitly:

>>> acc_dict = runner.benchmark_accuracy(quantizers=[gptq, jointq])

Include dequantized model accuracy:

>>> acc_dict = runner.benchmark_accuracy(dequantized_model=True)

print_quantization_results

print_quantization_results(quantizer=None)

Log quantization results.

Formats and logs the quantizer results. The following information is output for each layer:

  • Quantization time (seconds)
  • Output squared error (only if value exists)
  • Mean output squared error (only if value exists)
  • Weight squared error (only if value exists)
  • Mean weight squared error (only if value exists)

Parameters:

Name Type Description Default
quantizer Quantizer

The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Examples:

Single quantizer mode:

>>> runner.print_quantization_results()

Multiple quantizers mode:

>>> runner.print_quantization_results(quantizer=gptq)

save_quantization_statistics

save_quantization_statistics(path: str, quantizer=None)

Save the quantization statistics

Parameters:

Name Type Description Default
path str

File path to save to

required
quantizer Quantizer

Quantizer whose statistics to save. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Examples:

Single quantizer mode:

>>> runner.save_quantization_statistics("stats.json")

Multiple quantizers mode:

>>> quantizers = [gptq, jointq]
>>> runner.save_quantization_statistics("gptq_stats.json", quantizer=gptq)
>>> runner.save_quantization_statistics("jointq_stats.json", quantizer=jointq)

save_quantization_results

save_quantization_results(path: str, quantizer=None)

Save the quantization results to a file

Save quantization results (QuantizationResult objects) to a file. The saved data includes dequantized weights, scales, zero points, integer assignments, and other quantization parameters.

Parameters:

Name Type Description Default
path str

The path to save the quantization results. The .pt extension is recommended.

required
quantizer Quantizer

Quantizer whose results to save. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Examples:

Single quantizer mode:

>>> runner.save_quantization_results("results.pt")

Multiple quantizers mode:

>>> quantizers = [gptq, jointq]
>>> runner.save_quantization_results("gptq_results.pt", quantizer=gptq)
>>> runner.save_quantization_results("jointq_results.pt", quantizer=jointq)

save_dequantized_model

save_dequantized_model(path: str, quantizer=None)

Save the dequantized model to the specified path

Parameters:

Name Type Description Default
path str

The path to save the dequantized model.

required
quantizer Quantizer

The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Examples:

Single quantizer mode:

>>> runner.save_dequantized_model("./dequantized_model")

Multiple quantizers mode:

>>> runner.save_dequantized_model("./gptq_model", quantizer=gptq)
>>> runner.save_dequantized_model("./jointq_model", quantizer=jointq)

save_quantized_model

save_quantized_model(save_directory: str, pack_weights: bool = True)

Save the quantized model to the specified directory

Parameters:

Name Type Description Default
save_directory str

The path to save the quantized model.

required
pack_weights bool

Whether to pack quantized weights for more memory/storage-efficient representation.

True

Examples:

Single quantizer mode:

>>> runner.save_quantized_model("./quantized_model")

create_quantized_model

create_quantized_model(pack_weights: bool = True, quantizer=None, use_gemlite=None)

Create a quantized model from quantization results.

Loads the base model on CPU, replaces Linear layers with quantized inference layers (e.g. GPTQLinear), and attaches quantization config to model.config.

Must be called after run() (i.e., quantizer.results must be populated).

Parameters:

Name Type Description Default
pack_weights bool

Whether to pack quantized weights for memory-efficient representation. Default is True.

True
quantizer Quantizer

The quantizer to use. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None
use_gemlite bool or None

Whether to use GemLite for inference layers. Set to False when saving to avoid extra params in safetensors. Default is None (uses quantizer default).

None

Returns:

Type Description

tuple[nn.Module, PreTrainedTokenizer]: (quantized_model, tokenizer)

Examples:

>>> runner.run()
>>> model, tokenizer = runner.create_quantized_model()

With post-process:

>>> model, tokenizer = runner.create_quantized_model(pack_weights=False)
>>> post_process = PostProcessLoraSFT(data_files="train.jsonl")
>>> post_process.run(model, runner.model_config)

save_quantized_model_pt

save_quantized_model_pt(save_directory: str)

Save the quantized model as a PyTorch .pt file.

Use this method to save models that include post-processing modifications (e.g. LoRA adapters from PostProcessLoraSFT). The entire model object is serialized with torch.save, preserving custom module types such as LoRAGPTQLinear.

For models without post-processing, prefer save_quantized_model which uses the HF-compatible safetensors format.

The saved directory contains: - model.pt: The model (torch.save) - Tokenizer files (via save_pretrained)

Parameters:

Name Type Description Default
save_directory str

The path to save the model.

required
See Also

:func:onecomp.load_quantized_model_pt to load models saved by this method.

Examples:

>>> runner.run()  # with post_processes=[PostProcessLoraSFT(...)]
>>> runner.save_quantized_model_pt("./quantized_model_lora")

analyze_cumulative_error

analyze_cumulative_error(layer_keywords=None, plot_path=None, json_path=None, batch_keywords=False, quantizer=None)

Analyze cumulative quantization error for each linear layer.

Cumulative error: ||W_orig X_orig - W_quant X_quant||^2_F

Note

Must be used after calling the run() method.

Parameters:

Name Type Description Default
layer_keywords

List of keywords to filter layers. Each keyword is analyzed and plotted separately. Default: ["mlp.down_proj"] Example: ["q_proj", "k_proj"]

None
plot_path

Base path to save plots. Keyword is inserted before extension. Example: "error.png" -> "error_mlp.down_proj.png"

None
json_path

Path to save results as JSON file. Example: "cumulative_error.json"

None
batch_keywords

If True, process all keywords in a single forward pass. This is faster but uses more CPU memory because all target layers' outputs are stored simultaneously. If False (default), process each keyword separately with model reload per keyword. This uses less CPU memory but incurs overhead from repeated model loading and forward passes.

False
quantizer Quantizer

The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.

None

Returns:

Name Type Description
dict

keyword -> {layer_name -> cumulative squared error}

Examples:

Single quantizer mode:

>>> results = runner.analyze_cumulative_error()
>>> results = runner.analyze_cumulative_error(plot_path="cumulative_error.png")

Multiple quantizers mode:

>>> results = runner.analyze_cumulative_error(quantizer=gptq)

prepare_calibration_dataset

prepare_calibration_dataset(device)

Prepare calibration data for quantization methods such as GPTQ.

See utils.calibration.prepare_calibration_dataset for details.

Parameters:

Name Type Description Default
device device

Device to place tensors on (CPU or GPU)

required

Returns:

Name Type Description
dict

Input dictionary for the model - "input_ids": tensor of shape (num_chunks, max_length) - "attention_mask": tensor of shape (num_chunks, max_length)