Basic Usage¶

This guide covers the core workflow of Fujitsu One Compression (OneComp): configure a model, select a quantizer, run quantization, and evaluate results.

Quick Path: `Runner.auto_run()`¶

auto_run handles everything in one call -- VRAM-based bitwidth estimation, AutoBit mixed-precision quantization with QEP, evaluation, and saving:

from onecomp import Runner

runner = Runner.auto_run(model_id="meta-llama/Llama-2-7b-hf")

This automatically:

Loads the model and tokenizer
Estimates the target bitwidth from available VRAM
Quantizes with AutoBit (ILP-based mixed-precision) + QEP
Evaluates perplexity and zero-shot accuracy
Saves the quantized model to disk

The returned runner instance gives access to quantization results for further analysis. See the Quick Start for auto_run parameters, or use the CLI for command-line usage.

Detailed Workflow¶

For full control over each component, use the manual configuration approach.

ModelConfig ──┐
              ├──► Runner.run() ──► Evaluate / Save
Quantizer ────┘

Every quantization session follows the same pattern:

Create a ModelConfig to specify which model to quantize
Create a Quantizer (e.g., GPTQ, RTN, DBF) with desired parameters
Pass both to a Runner and call runner.run()
Evaluate or save the result

Step 1: Configure the Model¶

from onecomp import ModelConfig

model_config = ModelConfig(
    model_id="meta-llama/Llama-2-7b-hf",
    device="cuda:0",
)

Parameter	Description	Default
`model_id`	Hugging Face Hub model ID	—
`path`	Local path to a saved model	—
`dtype`	Data type (`"float16"`, `"float32"`)	`"float16"`
`device`	Device (`"cpu"`, `"cuda"`, `"auto"`)	`"auto"`

You must provide either model_id or path.

Step 2: Choose a Quantizer¶

from onecomp import GPTQ

gptq = GPTQ(wbits=4, groupsize=128)

Available quantizers and their typical parameters:

Quantizer	Key Parameters	Calibration Required
`AutoBitQuantizer`	`target_bit`, `assignment_strategy`	Yes
`GPTQ`	`wbits`, `groupsize`, `sym`	Yes
`RTN`	`wbits`, `groupsize`, `sym`	No
`DBF`	`target_bits`, `iters`	Yes
`JointQ`	`bits`, `group_size`	Yes

All quantizers share common parameters:

Parameter	Description	Default
`num_layers`	Max layers to quantize (None = all)	`None`
`calc_quant_error`	Calculate quantization error per layer	`False`
`exclude_layer_names`	Layer names to skip (exact match)	`["lm_head"]`
`include_layer_keywords`	Only quantize layers matching keywords	`None`

Step 3: Run Quantization¶

from onecomp import Runner, setup_logger

setup_logger()  # Optional: enable logging output

runner = Runner(model_config=model_config, quantizer=gptq)
runner.run()

Step 4: Evaluate¶

Perplexity¶

calculate_perplexity() returns a 3-tuple (original, dequantized, quantized). By default, only the quantized model is evaluated:

_, _, quantized_ppl = runner.calculate_perplexity()
print(f"Quantized: {quantized_ppl:.2f}")

# To also evaluate the original model:
original_ppl, _, quantized_ppl = runner.calculate_perplexity(original_model=True)
print(f"Original:  {original_ppl:.2f}")
print(f"Quantized: {quantized_ppl:.2f}")

Note

Evaluating the original or dequantized model requires loading the full model on GPU.
Quantized-model evaluation (quantized_model=True) is supported only for quantizers that implement create_quantized_model() (GPTQ, DBF, AutoBitQuantizer). For other quantizers, evaluation automatically falls back to the dequantized (FP16) model.

Zero-shot Accuracy¶

_, _, quantized_acc = runner.calculate_accuracy()

Quantization Statistics¶

runner.print_quantization_results()
runner.save_quantization_statistics("stats.json")

Step 5: Save the Model¶

# Save dequantized weights (FP16, compatible with any HF pipeline)
runner.save_dequantized_model("./output/dequantized")

# Save quantized model (packed integer weights, compatible with vLLM)
runner.save_quantized_model("./output/quantized")

Quantizer feature support

save_quantized_model(), create_quantized_model(), and quantized-model PPL/ACC evaluation require the quantizer to implement get_quant_config() and create_inference_layer(). Currently only GPTQ, DBF, and AutoBitQuantizer support these features.

Quantizer	Save	Quantized PPL/ACC	Fallback
`GPTQ`	Yes	Yes	—
`DBF`	Yes	Yes	—
`AutoBitQuantizer`	Yes	Yes	—
`RTN`	—	—	Dequantized (FP16) model
`JointQ`	—	—	Dequantized (FP16) model
`QUIP`	—	—	Dequantized (FP16) model
`CQ`	—	—	Dequantized (FP16) model
`ARB`	—	—	Dequantized (FP16) model
`QBB`	—	—	Dequantized (FP16) model
`Onebit`	—	—	Dequantized (FP16) model

For unsupported quantizers:

PPL/ACC evaluation: automatically falls back to the dequantized (FP16) model. No error is raised.
Saving: use save_dequantized_model() (FP16) or save_quantization_results() instead.

Enabling QEP¶

QEP adjusts weights before quantization to compensate for error propagation across layers. Simply set qep=True on the Runner:

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
)
runner.run()

See QEP Algorithm for the theory behind QEP.