Runner¶

The Runner class is the main entry point for executing quantization pipelines in OneComp.

Runner ¶

Runner(model_config=None, calibration_dataset=None, max_length=2048, num_calibration_samples=512, quantizer=None, quantizers=None, qep=False, qep_config=None, calibration_strategy='drop_rand', calibration_seed=0, multi_gpu=False, gpu_ids=None, calibration_batch_size=None, num_layers_per_group=7, post_processes=None)

Runner class for model quantization

Runner class for executing quantization. Supports quantization using calibration data and parallel quantization on multiple GPUs.

Examples:

Single GPU quantization (default):

>>> from onecomp import Runner, ModelConfig
>>> from onecomp.quantizer.gptq import GPTQ
>>> model_config = ModelConfig(model_id_or_path="meta-llama/Llama-2-7b-hf")
>>> quantizer = GPTQ(wbits=4, groupsize=128)
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
... )
>>> runner.run()

Multi-GPU quantization (layer-wise parallel):

>>> from onecomp.quantizer.jointq import JointQ
>>> quantizer = JointQ(bits=4, group_size=128)
>>> # Use all available GPUs
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     multi_gpu=True,
... )
>>> runner.run()

>>> # Use specific GPUs (e.g., GPU 0, 2, 3)
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     multi_gpu=True,
...     gpu_ids=[0, 2, 3],
... )
>>> runner.run()

init method

Parameters:

Name	Type	Description	Default
`model_config`	`ModelConfig`	Model configuration. Required.	`None`
`calibration_dataset`	`Dataset`		`None`
`max_length`	`int`	The maximum length of the input sequence.	`2048`
`num_calibration_samples`	`int`	The number of calibration samples to use when loading default dataset.	`512`
`quantizer`	`Quantizer`	The quantizer to use. Specify either `quantizer` or `quantizers`, not both. At least one must be given.	`None`
`quantizers`	`list[Quantizer]`	Specify multiple quantizers. When used with calibration_batch_size, the X^T X accumulation is shared, reducing the forward pass to a single execution. Specify either `quantizer` or `quantizers`, not both. Currently, this is only available when `calibration_batch_size` is set and `qep=False`.	`None`
`qep`	`bool`	Whether to use QEP.	`False`
`qep_config`	`QEPConfig or None`	Configuration for QEP. If None and `qep=True`, a default `QEPConfig()` is used.	`None`
`calibration_strategy`	`str`	Strategy for preparing calibration inputs. Default is "drop_rand". Available strategies: - "concat_chunk": Concatenate all texts, tokenize once, and split into fixed-length chunks (max_length). Creates as many chunks as possible from the data. - "concat_chunk_align": Same as concat_chunk, but adjusts the number of loaded samples so that num_chunks == num_calibration_samples. This ensures consistent token counts across experiments. - "drop_head": No cross-document mixing. Tokenize each document independently; drop samples with token length < max_length; take the head window (first max_length tokens). - "drop_rand": Same as above, but take a random window of length max_length from each long document (reproducible with calibration_seed).	`'drop_rand'`
`calibration_seed`	`int`	Random seed used by some calibration strategies (e.g., "drop_rand").	`0`
`multi_gpu`	`bool`	Whether to use multi-GPU for layer-wise parallel quantization. Default is False.	`False`
`gpu_ids`	`list[int]`	List of GPU IDs to use for multi-GPU quantization. If None and multi_gpu is True, all available GPUs will be used.	`None`
`calibration_batch_size`	`int or None`	Batch size (number of sentences) for chunked calibration forward passes. Default is None (all calibration data in a single forward pass). When set to a positive integer (e.g., 128), calibration data is split into chunks of this size and forwarded in multiple passes to reduce GPU memory usage. The necessary statistics (e.g., X^T X for Hessian-based methods) are accumulated across chunks. This is mathematically exact, not an approximation.	`None`
`num_layers_per_group`	`int`	Number of layers to process simultaneously in chunked calibration mode. Default is 7 (one Transformer block for Llama-like architectures: q,k,v,o,gate,up,down). Controls the trade-off between CPU memory usage for X^T X storage and the number of forward passes required. Only used when calibration_batch_size is set.	`7`
`post_processes`	`list[PostQuantizationProcess] or None`	Optional list of post-quantization processes to execute after the main quantization step. Each process receives a quantized model on CPU (built via `create_quantized_model`) and may modify it in-place. Processes are executed in order. Default is None.	`None`

Note

For zero-config quantization (VRAM auto-estimation + AutoBitQuantizer + QEP), use the class method :meth:auto_run instead.

Examples:

Chunked calibration with GPTQ (large-scale calibration data):

>>> from onecomp import Runner, ModelConfig
>>> from onecomp.quantizer.gptq import GPTQ
>>> model_config = ModelConfig(
...     model_id_or_path="meta-llama/Llama-2-7b-hf"
... )
>>> quantizer = GPTQ(wbits=4, groupsize=128)
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     max_length=2048,
...     num_calibration_samples=1024,
...     calibration_batch_size=128,  # Forward 128 sentences at a time
... )
>>> runner.run()

With custom num_layers_per_group:

>>> # When memory is sufficient: process 2 blocks (14 layers) simultaneously
>>> runner = Runner(
...     model_config=model_config,
...     quantizer=quantizer,
...     max_length=2048,
...     num_calibration_samples=1024,
...     calibration_batch_size=128,
...     num_layers_per_group=14,
... )
>>> runner.run()

Multiple quantizers (benchmark comparison):

>>> from onecomp.quantizer.gptq import GPTQ
>>> from onecomp.quantizer.jointq import JointQ
>>> gptq = GPTQ(wbits=4, groupsize=128, calc_quant_error=True)
>>> jointq = JointQ(bits=4, group_size=128, calc_quant_error=True,
...                 device=torch.device(0))
>>> runner = Runner(
...     model_config=model_config,
...     quantizers=[gptq, jointq],
...     max_length=2048,
...     num_calibration_samples=1024,
...     calibration_batch_size=128,
... )
>>> runner.run()
>>> # Results are stored in gptq.results and jointq.results respectively

auto_run `classmethod` ¶

auto_run(model_id: str, wbits: Optional[float] = None, total_vram_gb: Optional[float] = None, groupsize: int = 128, device: str = 'cuda:0', qep: bool = True, evaluate: bool = True, eval_original_model: bool = False, save_dir: str = 'auto', **kwargs)

One-liner quantization with sensible defaults.

Sets up ModelConfig, AutoBitQuantizer (ILP-based mixed-precision), and QEP, then runs quantization. When wbits is None, the target bitwidth is estimated automatically from available VRAM. Optionally evaluates perplexity and accuracy, and saves the quantized model.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Hugging Face model ID or local path.	required
`wbits`	`float or None`	Target quantization bitwidth. When `None` (default), estimated from VRAM via `estimate_wbits_from_vram`.	`None`
`total_vram_gb`	`float or None`	Total VRAM budget in GB for bitwidth estimation. Only used when `wbits` is `None`. When `None`, the installed GPU VRAM is detected automatically.	`None`
`groupsize`	`int`	GPTQ group size (default: 128). Use -1 to disable grouping.	`128`
`device`	`str`	Device to place the model on (default: "cuda:0").	`'cuda:0'`
`qep`	`bool`	Whether to use QEP (default: True).	`True`
`evaluate`	`bool`	Whether to calculate perplexity and accuracy after quantization (default: True).	`True`
`eval_original_model`	`bool`	Whether to also evaluate the original (unquantized) model (default: False).	`False`
`save_dir`	`str or None`	Directory to save the quantized model. `"auto"` (default) derives the path from model_id (e.g., `"TinyLlama-1.1B-...-autobit-3.5bit"`). Set to `None` to skip saving.	`'auto'`
`**kwargs`		Additional keyword arguments forwarded to the `GPTQ` constructor (e.g., `actorder`, `sym`).	`{}`

Returns:

Name	Type	Description
`Runner`		The configured Runner instance (with quantization
		results accessible via `runner.quantizer.results`).

Examples:

Minimal usage (QEP + GPTQ 4-bit, groupsize=128, auto-save):

>>> from onecomp import Runner
>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
... )

Custom save directory:

>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
...     save_dir="./my_quantized_model",
... )

Skip saving:

>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
...     save_dir=None,
... )

Evaluate both original and quantized models:

>>> runner = Runner.auto_run(
...     model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
...     eval_original_model=True,
... )

run ¶

run()

Execute quantization (and related) processing

check ¶

check()

Check the settings

Performs the following checks:

model_config is a ModelConfig instance
Mutual exclusion check for quantizer and quantizers (cannot specify both)
Type check for quantizer / quantizers (must be Quantizer instances)
At least one of them must be specified
Parameter combination consistency check (see table below)
When multi_gpu=True, quantizer.flag_calibration=True must hold

Valid parameter combinations:

=========== ==== ========== ====================== quantizers qep multi_gpu calibration_batch_size =========== ==== ========== ====================== Specified False False Specified None True False None None False True None None False False Specified None False False None =========== ==== ========== ======================

Note

multi_gpu=True requires a quantizer with flag_calibration=True.

Raises:

Type	Description
`TypeError`	Invalid type for `model_config`, `quantizer`, or `quantizers`
`ValueError`	Invalid parameter combination

calculate_perplexity ¶

calculate_perplexity(original_model=False, dequantized_model=False, quantized_model=True, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048, quantizer=None)

Calculate the perplexity of the model

Parameters:

Name	Type	Description	Default
`original_model`	`bool`	Whether to calculate the perplexity of the original model.	`False`
`dequantized_model`	`bool`	Whether to calculate the perplexity of the dequantized model.	`False`
`quantized_model`	`bool`	Whether to calculate the perplexity of the quantized model.	`True`
`dataset_name`	`str`	The name of the dataset to use for calculating perplexity.	`'wikitext'`
`dataset_config`	`str`	The configuration of the dataset.	`'wikitext-2-raw-v1'`
`split`	`str`	The split of the dataset to use.	`'test'`
`max_samples`	`int`	The maximum number of samples to use.	`None`
`max_length`	`int`	Maximum length of the sliding window. Uses model.config.max_position_embeddings if None. 2048 is recommended to match standard paper values.	`2048`
`stride`	`int`	Stride of the sliding window. Same as max_length (no overlap) if None.	`2048`
`quantizer`	`Quantizer`	The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Returns:

Name	Type	Description
`tuple`		(original_ppl, dequantized_ppl, quantized_ppl)

Note

Evaluating the original or dequantized model requires loading the full model on GPU.

Quantized-model evaluation (quantized_model=True) is currently supported only for GPTQ and DBF quantizers. Support for other quantization methods is planned.

Examples:

Single quantizer mode:

>>> original_ppl, dequantized_ppl, quantized_ppl = runner.calculate_perplexity()

Multiple quantizers mode:

>>> original_ppl, dequantized_ppl, quantized_ppl = runner.calculate_perplexity(
...     quantizer=gptq
... )

benchmark_perplexity ¶

benchmark_perplexity(original_model=True, dequantized_model=False, quantized_model=True, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048, quantizers=None)

Calculate perplexity for all quantizers at once

Internally calls calculate_perplexity for each quantizer. The original model PPL is calculated only once (on the first iteration).

Parameters:

Name	Type	Description	Default
`original_model`	`bool`	Whether to calculate the perplexity of the original model.	`True`
`dequantized_model`	`bool`	Whether to calculate the perplexity of the dequantized model.	`False`
`quantized_model`	`bool`	Whether to calculate the perplexity of the quantized model.	`True`
`dataset_name`	`str`	The name of the dataset to use for calculating perplexity.	`'wikitext'`
`dataset_config`	`str`	The configuration of the dataset.	`'wikitext-2-raw-v1'`
`split`	`str`	The split of the dataset to use.	`'test'`
`max_samples`	`int`	The maximum number of samples to use.	`None`
`max_length`	`int`	Maximum length of the sliding window. Uses model.config.max_position_embeddings if None.	`2048`
`stride`	`int`	Stride of the sliding window. Same as max_length (no overlap) if None.	`2048`
`quantizers`	`list[Quantizer]`	List of quantizers. Uses self.quantizers or [self.quantizer] if None.	`None`

Returns:

Name	Type	Description
`dict`		Dictionary of PPL values. Keys are as follows:
		`"original"`: PPL of the original model (not included if skipped)
		`quantizer.name`: PPL for each quantizer (quantized or dequantized, with quantized taking precedence)
		`quantizer.name + "_dequantized"`: PPL of the dequantized model (only included when `dequantized_model=True`)

Examples:

>>> runner.run()
>>> ppl_dict = runner.benchmark_perplexity()
>>> print(ppl_dict)
{'original': 5.47, 'GPTQ': 5.72, 'JointQ': 5.68}

Specify quantizers explicitly:

>>> ppl_dict = runner.benchmark_perplexity(quantizers=[gptq, jointq])

Include dequantized model PPL:

>>> ppl_dict = runner.benchmark_perplexity(dequantized_model=True)
>>> print(ppl_dict)
{'original': 5.47, 'GPTQ': 5.72, 'GPTQ_dequantized': 5.71}

calculate_accuracy ¶

calculate_accuracy(original_model=False, dequantized_model=False, quantized_model=True, tasks=None, batch_size=8, num_fewshot=0, display_results=True, quantizer=None)

Calculate the zero-shot accuracy of the model

Parameters:

Name	Type	Description	Default
`original_model`	`bool`	Whether to calculate the accuracy of the original model.	`False`
`dequantized_model`	`bool`	Whether to calculate the accuracy of the dequantized model.	`False`
`quantized_model`	`bool`	Whether to calculate the accuracy of the quantized model.	`True`
`tasks`	`list`	The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"]	`None`
`batch_size`	`int`	The batch size for evaluation.	`8`
`num_fewshot`	`int`	The number of few-shot examples.	`0`
`display_results`	`bool`	Whether to display the results.	`True`
`quantizer`	`Quantizer`	The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Returns:

Name	Type	Description
`tuple`		(original_acc, dequantized_acc, quantized_acc)

Note

Evaluating the original or dequantized model requires loading the full model on GPU.

Quantized-model evaluation (quantized_model=True) is currently supported only for GPTQ and DBF quantizers. Support for other quantization methods is planned.

Examples:

Single quantizer mode:

>>> original_acc, dequantized_acc, quantized_acc = runner.calculate_accuracy()

Multiple quantizers mode:

>>> original_acc, dequantized_acc, quantized_acc = runner.calculate_accuracy(
...     quantizer=gptq
... )

benchmark_accuracy ¶

benchmark_accuracy(original_model=True, dequantized_model=False, quantized_model=True, tasks=None, batch_size=8, num_fewshot=0, display_results=False, quantizers=None)

Calculate accuracy for all quantizers at once

Internally calls calculate_accuracy for each quantizer. The original model accuracy is calculated only once (on the first iteration).

Parameters:

Name	Type	Description	Default
`original_model`	`bool`	Whether to calculate the accuracy of the original model.	`True`
`dequantized_model`	`bool`	Whether to calculate the accuracy of the dequantized model.	`False`
`quantized_model`	`bool`	Whether to calculate the accuracy of the quantized model.	`True`
`tasks`	`list`	The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"]	`None`
`batch_size`	`int`	The batch size for evaluation.	`8`
`num_fewshot`	`int`	The number of few-shot examples.	`0`
`display_results`	`bool`	Whether to display the results.	`False`
`quantizers`	`list[Quantizer]`	List of quantizers. Uses self.quantizers or [self.quantizer] if None.	`None`

Returns:

Name	Type	Description
`dict`		Dictionary of accuracy values. Keys are as follows:
		`"original"`: Accuracy of the original model (not included if skipped)
		`quantizer.name`: Accuracy for each quantizer (quantized or dequantized, with quantized taking precedence)
		`quantizer.name + "_dequantized"`: Accuracy of the dequantized model (only included when `dequantized_model=True`)

Examples:

>>> runner.run()
>>> acc_dict = runner.benchmark_accuracy()
>>> print(acc_dict)
{'original': {...}, 'GPTQ': {...}, 'JointQ': {...}}

Specify quantizers explicitly:

>>> acc_dict = runner.benchmark_accuracy(quantizers=[gptq, jointq])

Include dequantized model accuracy:

>>> acc_dict = runner.benchmark_accuracy(dequantized_model=True)

print_quantization_results ¶

print_quantization_results(quantizer=None)

Log quantization results.

Formats and logs the quantizer results. The following information is output for each layer:

Quantization time (seconds)
Output squared error (only if value exists)
Mean output squared error (only if value exists)
Weight squared error (only if value exists)
Mean weight squared error (only if value exists)

Parameters:

Name	Type	Description	Default
`quantizer`	`Quantizer`	The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Examples:

Single quantizer mode:

>>> runner.print_quantization_results()

Multiple quantizers mode:

>>> runner.print_quantization_results(quantizer=gptq)

save_quantization_statistics ¶

save_quantization_statistics(path: str, quantizer=None)

Save the quantization statistics

Parameters:

Name	Type	Description	Default
`path`	`str`	File path to save to	required
`quantizer`	`Quantizer`	Quantizer whose statistics to save. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Examples:

Single quantizer mode:

>>> runner.save_quantization_statistics("stats.json")

Multiple quantizers mode:

>>> quantizers = [gptq, jointq]
>>> runner.save_quantization_statistics("gptq_stats.json", quantizer=gptq)
>>> runner.save_quantization_statistics("jointq_stats.json", quantizer=jointq)

save_quantization_results ¶

save_quantization_results(path: str, quantizer=None)

Save the quantization results to a file

Save quantization results (QuantizationResult objects) to a file. The saved data includes dequantized weights, scales, zero points, integer assignments, and other quantization parameters.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to save the quantization results. The .pt extension is recommended.	required
`quantizer`	`Quantizer`	Quantizer whose results to save. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Examples:

Single quantizer mode:

>>> runner.save_quantization_results("results.pt")

Multiple quantizers mode:

>>> quantizers = [gptq, jointq]
>>> runner.save_quantization_results("gptq_results.pt", quantizer=gptq)
>>> runner.save_quantization_results("jointq_results.pt", quantizer=jointq)

save_dequantized_model ¶

save_dequantized_model(path: str, quantizer=None)

Save the dequantized model to the specified path

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to save the dequantized model.	required
`quantizer`	`Quantizer`	The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Examples:

Single quantizer mode:

>>> runner.save_dequantized_model("./dequantized_model")

Multiple quantizers mode:

>>> runner.save_dequantized_model("./gptq_model", quantizer=gptq)
>>> runner.save_dequantized_model("./jointq_model", quantizer=jointq)

save_quantized_model ¶

save_quantized_model(save_directory: str, pack_weights: bool = True)

Save the quantized model to the specified directory

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	The path to save the quantized model.	required
`pack_weights`	`bool`	Whether to pack quantized weights for more memory/storage-efficient representation.	`True`

Examples:

Single quantizer mode:

>>> runner.save_quantized_model("./quantized_model")

create_quantized_model ¶

create_quantized_model(pack_weights: bool = True, quantizer=None, use_gemlite=None)

Create a quantized model from quantization results.

Loads the base model on CPU, replaces Linear layers with quantized inference layers (e.g. GPTQLinear), and attaches quantization config to model.config.

Must be called after run() (i.e., quantizer.results must be populated).

Parameters:

Name	Type	Description	Default
`pack_weights`	`bool`	Whether to pack quantized weights for memory-efficient representation. Default is True.	`True`
`quantizer`	`Quantizer`	The quantizer to use. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`
`use_gemlite`	`bool or None`	Whether to use GemLite for inference layers. Set to False when saving to avoid extra params in safetensors. Default is None (uses quantizer default).	`None`

Returns:

Type	Description
	tuple[nn.Module, PreTrainedTokenizer]: (quantized_model, tokenizer)

Examples:

>>> runner.run()
>>> model, tokenizer = runner.create_quantized_model()

With post-process:

>>> model, tokenizer = runner.create_quantized_model(pack_weights=False)
>>> post_process = PostProcessLoraSFT(data_files="train.jsonl")
>>> post_process.run(model, runner.model_config)

save_quantized_model_pt ¶

save_quantized_model_pt(save_directory: str)

Save the quantized model as a PyTorch .pt file.

Use this method to save models that include post-processing modifications (e.g. LoRA adapters from PostProcessLoraSFT). The entire model object is serialized with torch.save, preserving custom module types such as LoRAGPTQLinear.

For models without post-processing, prefer save_quantized_model which uses the HF-compatible safetensors format.

The saved directory contains: - model.pt: The model (torch.save) - Tokenizer files (via save_pretrained)

Parameters:

Name	Type	Description	Default
`save_directory`	`str`	The path to save the model.	required

analyze_cumulative_error ¶

analyze_cumulative_error(layer_keywords=None, plot_path=None, json_path=None, batch_keywords=False, quantizer=None)

Analyze cumulative quantization error for each linear layer.

Cumulative error: ||W_orig X_orig - W_quant X_quant||^2_F

Note

Must be used after calling the run() method.

Parameters:

Name	Type	Description	Default
`layer_keywords`		List of keywords to filter layers. Each keyword is analyzed and plotted separately. Default: ["mlp.down_proj"] Example: ["q_proj", "k_proj"]	`None`
`plot_path`		Base path to save plots. Keyword is inserted before extension. Example: "error.png" -> "error_mlp.down_proj.png"	`None`
`json_path`		Path to save results as JSON file. Example: "cumulative_error.json"	`None`
`batch_keywords`		If True, process all keywords in a single forward pass. This is faster but uses more CPU memory because all target layers' outputs are stored simultaneously. If False (default), process each keyword separately with model reload per keyword. This uses less CPU memory but incurs overhead from repeated model loading and forward passes.	`False`
`quantizer`	`Quantizer`	The quantizer. Uses self.quantizer if None. Specify explicitly when using quantizers mode.	`None`

Returns:

Name	Type	Description
`dict`		keyword -> {layer_name -> cumulative squared error}

Examples:

Single quantizer mode:

>>> results = runner.analyze_cumulative_error()
>>> results = runner.analyze_cumulative_error(plot_path="cumulative_error.png")

Multiple quantizers mode:

>>> results = runner.analyze_cumulative_error(quantizer=gptq)

prepare_calibration_dataset ¶

prepare_calibration_dataset(device)

Prepare calibration data for quantization methods such as GPTQ.

See utils.calibration.prepare_calibration_dataset for details.

Parameters:

Name	Type	Description	Default
`device`	`device`	Device to place tensors on (CPU or GPU)	required

Returns:

Name	Type	Description
`dict`		Input dictionary for the model - "input_ids": tensor of shape (num_chunks, max_length) - "attention_mask": tensor of shape (num_chunks, max_length)

Runner¶

Runner ¶

auto_run classmethod ¶

run ¶

check ¶

calculate_perplexity ¶

benchmark_perplexity ¶

calculate_accuracy ¶

benchmark_accuracy ¶

print_quantization_results ¶

save_quantization_statistics ¶

save_quantization_results ¶

save_dequantized_model ¶

save_quantized_model ¶

create_quantized_model ¶

save_quantized_model_pt ¶

analyze_cumulative_error ¶

prepare_calibration_dataset ¶

auto_run `classmethod` ¶