Examples¶
This page demonstrates common usage patterns beyond the basic workflow.
One-liner with auto_run¶
The simplest way to quantize a model:
from onecomp import Runner
# Default: AutoBit (VRAM auto-estimation, ILP mixed-precision) + QEP
Runner.auto_run(model_id="meta-llama/Llama-2-7b-hf")
# Specify VRAM budget
Runner.auto_run(model_id="meta-llama/Llama-2-7b-hf", total_vram_gb=8)
# Fixed 4-bit, custom save directory
Runner.auto_run(
model_id="meta-llama/Llama-2-7b-hf",
wbits=4,
save_dir="./llama2-7b-gptq-4bit",
)
# Without QEP, skip evaluation
Runner.auto_run(
model_id="meta-llama/Llama-2-7b-hf",
qep=False,
evaluate=False,
)
# Also evaluate the original model for comparison
Runner.auto_run(
model_id="meta-llama/Llama-2-7b-hf",
eval_original_model=True,
)
CLI¶
The onecomp command provides the same functionality from the terminal:
# Default (AutoBit with VRAM auto-estimation + QEP)
onecomp meta-llama/Llama-2-7b-hf
# Specify VRAM budget
onecomp meta-llama/Llama-2-7b-hf --total-vram-gb 8
# Fixed 4-bit, custom save directory
onecomp meta-llama/Llama-2-7b-hf --wbits 4 --save-dir ./llama2-7b-gptq-4bit
# Without QEP, skip evaluation
onecomp meta-llama/Llama-2-7b-hf --no-qep --no-eval
# Also evaluate the original model
onecomp meta-llama/Llama-2-7b-hf --eval-original
# Skip saving
onecomp meta-llama/Llama-2-7b-hf --save-dir none
See the CLI Reference for all options.
GPTQ with QEP (3-bit)¶
Quantize a model using GPTQ at 3-bit precision with QEP to improve quality:
from onecomp import ModelConfig, Runner, GPTQ, setup_logger
setup_logger()
model_config = ModelConfig(
model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
device="cuda:0",
)
gptq = GPTQ(wbits=3)
runner = Runner(model_config=model_config, quantizer=gptq, qep=True)
runner.run()
_, _, quantized_ppl = runner.calculate_perplexity()
print(f"Quantized model perplexity: {quantized_ppl}")
GPTQ without QEP¶
Standard GPTQ quantization without error propagation:
gptq = GPTQ(wbits=3)
runner = Runner(model_config=model_config, quantizer=gptq, qep=False)
runner.run()
JointQ (4-bit, groupsize=128)¶
Quantize a model using JointQ, which jointly optimizes weight assignments and scale parameters:
from onecomp import JointQ, ModelConfig, Runner, setup_logger
setup_logger()
model_config = ModelConfig(
model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
device="cuda:0",
)
jointq = JointQ(bits=4, group_size=128)
runner = Runner(model_config=model_config, quantizer=jointq, qep=False)
runner.run()
original_ppl, dequantized_ppl, _ = runner.calculate_perplexity(
original_model=True, dequantized_model=True, quantized_model=False,
)
print(f"Original model perplexity: {original_ppl}")
print(f"Dequantized model perplexity: {dequantized_ppl}")
Chunked Calibration (Large-scale Data)¶
When using large calibration datasets that don't fit in GPU memory, use chunked calibration.
The calibration_batch_size parameter splits the forward pass into smaller batches while
accumulating statistics exactly:
gptq = GPTQ(wbits=4, groupsize=128)
runner = Runner(
model_config=model_config,
quantizer=gptq,
max_length=2048,
num_calibration_samples=1024,
calibration_batch_size=128,
)
runner.run()
Info
Chunked calibration is mathematically exact -- it accumulates \(X^T X\) across batches without approximation.
Multi-GPU Quantization¶
Distribute layer-wise quantization across multiple GPUs:
runner = Runner(
model_config=model_config,
quantizer=gptq,
multi_gpu=True,
)
runner.run()
# Or specify particular GPUs
runner = Runner(
model_config=model_config,
quantizer=gptq,
multi_gpu=True,
gpu_ids=[0, 2, 3],
)
runner.run()
Comparing Multiple Quantizers¶
Run multiple quantizers in a single session with shared calibration data:
from onecomp import GPTQ
from onecomp.quantizer.jointq import JointQ
import torch
gptq = GPTQ(wbits=4, groupsize=128, calc_quant_error=True)
jointq = JointQ(bits=4, group_size=128, calc_quant_error=True,
device=torch.device(0))
runner = Runner(
model_config=model_config,
quantizers=[gptq, jointq],
max_length=2048,
num_calibration_samples=1024,
calibration_batch_size=128,
)
runner.run()
# Benchmark perplexity across all quantizers
ppl_dict = runner.benchmark_perplexity()
print(ppl_dict)
# {'original': 5.47, 'GPTQ': 5.72, 'JointQ': 5.68}
Rotation Preprocessing + RTN¶
Apply SpinQuant-style rotation preprocessing before quantization to reduce quantization error:
from onecomp import ModelConfig, Runner, RTN, prepare_rotated_model, setup_logger
setup_logger()
model_config = ModelConfig(
model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
device="cuda:0",
)
# Step 1: Rotation preprocessing
rotated_config = prepare_rotated_model(
model_config=model_config,
save_directory="./rotated_model",
seed=0,
wbits=3,
groupsize=-1,
sym=False,
)
# Step 2: Quantize the rotated model (wbits/groupsize/sym must match Step 1)
rtn = RTN(wbits=3, groupsize=-1, sym=False)
runner = Runner(model_config=rotated_config, quantizer=rtn)
runner.run()
original_ppl, dequantized_ppl, _ = runner.calculate_perplexity(
original_model=True, dequantized_model=True, quantized_model=False
)
print(f"Original model perplexity: {original_ppl}")
print(f"Dequantized model perplexity: {dequantized_ppl}")
Rotation Preprocessing + GPTQ with Save/Load¶
Full pipeline including save and load of rotation-preprocessed quantized models:
from onecomp import (
ModelConfig, Runner, GPTQ,
prepare_rotated_model, load_quantized_model, setup_logger,
)
setup_logger()
# Step 1: Rotation preprocessing
model_config = ModelConfig(
model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
device="cuda:0",
)
rotated_config = prepare_rotated_model(
model_config=model_config,
save_directory="./rotated_model",
seed=0,
wbits=4,
groupsize=128,
)
# Step 2: Quantize and save
gptq = GPTQ(wbits=4, groupsize=128)
runner = Runner(model_config=rotated_config, quantizer=gptq)
runner.run()
runner.save_quantized_model("./quantized_model")
# Step 3: Load (Hadamard hooks are auto-registered via "rotated: true" in config.json)
model, tokenizer = load_quantized_model("./quantized_model")
See Pre-Process API for full parameter documentation.
Saving and Loading Quantized Models¶
Save the quantized model¶
# Save with packed integer weights (compatible with vLLM)
runner.save_quantized_model("./output/my_quantized_model")
# Or save dequantized FP16 weights
runner.save_dequantized_model("./output/my_dequantized_model")
Load a saved quantized model¶
from onecomp import load_quantized_model
model, tokenizer = load_quantized_model("./output/my_quantized_model")
# Use like any Hugging Face model
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
LoRA SFT: Accuracy Recovery¶
Quantize a model and apply LoRA SFT to recover accuracy lost during quantization:
from onecomp import GPTQ, ModelConfig, Runner, PostProcessLoraSFT, setup_logger
setup_logger()
model_config = ModelConfig(
model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
device="cuda:0",
)
gptq = GPTQ(wbits=4, groupsize=128)
post_process = PostProcessLoraSFT(
dataset_name="wikitext",
dataset_config_name="wikitext-2-raw-v1",
train_split="train",
text_column="text",
max_train_samples=256,
max_length=512,
epochs=2,
batch_size=2,
gradient_accumulation_steps=4,
lr=1e-4,
lora_r=16,
lora_alpha=32,
)
runner = Runner(
model_config=model_config,
quantizer=gptq,
post_processes=[post_process],
)
runner.run()
original_ppl, _, quantized_ppl = runner.calculate_perplexity(
original_model=True, quantized_model=True,
)
print(f"Original PPL: {original_ppl:.4f}")
print(f"Quantized + LoRA SFT PPL: {quantized_ppl:.4f}")
LoRA SFT: Knowledge Injection¶
Inject custom knowledge into a quantized model using a JSONL file:
from onecomp import GPTQ, ModelConfig, Runner, PostProcessLoraSFT
model_config = ModelConfig(model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", device="cuda:0")
gptq = GPTQ(wbits=4, groupsize=128)
post_process = PostProcessLoraSFT(
data_files="./my_knowledge.jsonl",
max_length=256,
epochs=20,
batch_size=2,
lr=3e-4,
lora_r=16,
lora_alpha=32,
)
runner = Runner(model_config=model_config, quantizer=gptq, post_processes=[post_process])
runner.run()
See Post-Process (LoRA SFT) for the full guide including teacher distillation and save/load.
Saving and Loading LoRA Models¶
LoRA-applied models use a dedicated save/load API:
# Save after LoRA SFT
runner.save_quantized_model_pt("./my_model_lora")
# Load
from onecomp import load_quantized_model_pt
model, tokenizer = load_quantized_model_pt("./my_model_lora")
Note
For standard quantized models (without LoRA), use save_quantized_model() / load_quantized_model() instead.
Analyzing Cumulative Error¶
Analyze how quantization error accumulates across layers:
runner.run()
results = runner.analyze_cumulative_error(
layer_keywords=["mlp.down_proj"],
plot_path="cumulative_error.png",
json_path="cumulative_error.json",
)
Saving Quantization Statistics¶
runner.run()
runner.print_quantization_results()
runner.save_quantization_statistics("stats.json")
runner.save_quantization_results("results.pt")