Quick Start¶

This guide walks you through quantizing your first LLM with Fujitsu One Compression (OneComp).

The Fastest Way: `auto_run`¶

Runner.auto_run handles everything -- model loading, AutoBit mixed-precision quantization with QEP, evaluation (perplexity + zero-shot accuracy), and saving the quantized model:

PythonCLI

from onecomp import Runner

Runner.auto_run(model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")

onecomp TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

That's it. The target bitwidth is estimated from available VRAM, and the quantized model is saved to TinyLlama-1.1B-...-autobit-<X>bit/ by default.

`auto_run` Parameters¶

Parameter	Default	Description
`model_id`	(required)	Hugging Face model ID or local path
`wbits`	`None`	Target bitwidth. When `None`, estimated from VRAM
`total_vram_gb`	`None`	VRAM budget in GB. When `None`, detected from GPU
`groupsize`	`128`	GPTQ group size (`-1` to disable)
`device`	`"cuda:0"`	Device for computation
`qep`	`True`	Enable QEP (Quantization Error Propagation)
`evaluate`	`True`	Calculate perplexity and zero-shot accuracy
`eval_original_model`	`False`	Also evaluate the original (unquantized) model
`save_dir`	`"auto"`	Save directory (`"auto"` = derived from model name, `None` to skip)

Examples¶

from onecomp import Runner

# AutoBit with VRAM auto-estimation (default)
Runner.auto_run(model_id="meta-llama/Llama-2-7b-hf")

# Specify VRAM budget
Runner.auto_run(model_id="meta-llama/Llama-2-7b-hf", total_vram_gb=8)

# Fixed 3-bit quantization, no QEP, skip saving
Runner.auto_run(
    model_id="meta-llama/Llama-2-7b-hf",
    wbits=3,
    qep=False,
    save_dir=None,
)

# Custom save directory, skip evaluation
Runner.auto_run(
    model_id="meta-llama/Llama-2-7b-hf",
    save_dir="./my_quantized_model",
    evaluate=False,
)

Step-by-step Workflow¶

For full control over each component, use the manual configuration approach.

The workflow involves three components:

ModelConfig -- specifies which model to quantize
Quantizer (e.g., GPTQ) -- defines the quantization method and parameters
Runner -- orchestrates the quantization pipeline

from onecomp import ModelConfig, Runner, GPTQ, setup_logger

setup_logger()

model_config = ModelConfig(
    model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
    device="cuda:0",
)
gptq = GPTQ(wbits=4, groupsize=128)

runner = Runner(model_config=model_config, quantizer=gptq)
runner.run()

Evaluating the Quantized Model¶

After quantization, measure the impact on model quality:

# Perplexity (lower is better)
# Returns a 3-tuple: (original, dequantized, quantized)
# By default, only quantized is computed (others are None)
_, _, quantized_ppl = runner.calculate_perplexity()
print(f"Quantized: {quantized_ppl:.2f}")

# To also evaluate the original model, pass original_model=True
original_ppl, _, quantized_ppl = runner.calculate_perplexity(original_model=True)
print(f"Original:  {original_ppl:.2f}")
print(f"Quantized: {quantized_ppl:.2f}")

# Zero-shot accuracy (same 3-tuple pattern)
_, _, quantized_acc = runner.calculate_accuracy()

Note

Evaluating the original or dequantized model requires loading the full model on GPU.
Quantized-model evaluation is currently supported only for GPTQ and DBF quantizers. Support for other methods is planned.

Using QEP (Quantization Error Propagation)¶

QEP compensates for error propagation across layers, improving quantization quality -- especially at lower bit-widths:

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
)
runner.run()

For fine-grained control over QEP, use QEPConfig:

from onecomp import QEPConfig

qep_config = QEPConfig(
    percdamp=0.01,
    perccorr=0.5,
)
runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
    qep_config=qep_config,
)
runner.run()

Saving and Loading¶

Save a dequantized model (FP16 weights)¶

runner.save_dequantized_model("./output/dequantized_model")

Save a quantized model (packed integer weights)¶

runner.save_quantized_model("./output/quantized_model")

Load a saved quantized model¶

from onecomp import load_quantized_model

model, tokenizer = load_quantized_model("./output/quantized_model")

Next Steps¶

CLI Reference -- full CLI options and usage
Configuration -- detailed explanation of ModelConfig, QEPConfig, and Runner parameters
Examples -- more usage patterns including multi-GPU and chunked calibration
Algorithms -- learn about the quantization algorithms available in OneComp