Pre-Process (Rotation Preprocessing)¶
Rotation preprocessing applies SpinQuant/OstQuant-style rotation matrices to model weights before quantization, reducing quantization error. This is particularly effective for low-bit quantization (e.g., 3-bit).
Overview¶
The rotation preprocessing pipeline:
- Trains rotation/scaling matrices using calibration data with an RTN quantization proxy
- Absorbs the learned matrices into model weights (fuses LayerNorms, rotates projections)
- Registers online Hadamard hooks on
down_projlayers for inference correctness - Saves the rotated model as a standard Hugging Face model directory
The saved model can then be quantized with any quantizer (GPTQ, RTN, etc.) using the
standard Runner pipeline.
Quick Start¶
from onecomp import ModelConfig, Runner, GPTQ, prepare_rotated_model, setup_logger
setup_logger()
# Step 1: Rotation preprocessing
model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")
rotated_config = prepare_rotated_model(
model_config=model_config,
save_directory="./rotated_model",
wbits=4,
groupsize=128,
)
# Step 2: Quantize (wbits/groupsize/sym must match Step 1)
gptq = GPTQ(wbits=4, groupsize=128)
runner = Runner(model_config=rotated_config, quantizer=gptq)
runner.run()
Custom calibration data¶
Pass a CalibrationConfig to control the calibration dataset, sequence length,
or sample count used during rotation training. See
Configuration › CalibrationConfig for the
full parameter list.
from onecomp import CalibrationConfig, prepare_rotated_model
rotated_config = prepare_rotated_model(
model_config=model_config,
save_directory="./rotated_model",
wbits=4,
groupsize=128,
calibration_config=CalibrationConfig(
max_length=2048,
num_calibration_samples=256,
),
)
Supported Architectures¶
| Architecture | Status |
|---|---|
| Llama | Supported |
| Qwen3 | Supported |
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
rotation |
Apply rotation matrices (R1, R2) | True |
scaling |
Apply scaling diagonals (S_*) | False |
rotation_mode |
Rotation init mode: "random_hadamard", "hadamard", "random", "identity" |
"random_hadamard" |
scaling_mode |
Scaling init mode: "identity", "random_ones", "random" |
"identity" |
enable_training |
Train rotation matrices (vs. random init) | True |
wbits |
RTN proxy bit-width (must match quantizer) | 4 |
groupsize |
RTN proxy group size (must match quantizer) | -1 |
sym |
RTN proxy symmetric quantization | False |
mse |
Enable MSE grid search for RTN proxy clipping | False |
norm |
Lp norm exponent for MSE search | 2.4 |
grid |
Number of candidate shrink levels for MSE search | 100 |
fp32_had |
Use FP32 for online Hadamard transform | False |
calibration_config |
Calibration data configuration. See CalibrationConfig. When None, a default CalibrationConfig() is used. |
None |
seed |
Seed for rotation matrix initialisation. The calibration-data seed is controlled by calibration_config.seed. |
0 |
Input validation
prepare_rotated_model validates all parameters on entry. Invalid values for
rotation_mode, scaling_mode, calibration_config.strategy, or out-of-range
numeric parameters (e.g. wbits < 1, grid < 1) raise ValueError.
Parameter matching
The wbits, groupsize, and sym parameters must match the quantizer
used in Step 2. Mismatched values will degrade quantization quality because
the rotation matrices were optimized for different quantization settings.
Save and Load¶
Rotation-preprocessed quantized models support the standard save/load API:
# Save
runner.save_quantized_model("./quantized_model")
# Load (Hadamard hooks are auto-registered)
from onecomp import load_quantized_model
model, tokenizer = load_quantized_model("./quantized_model")
After runner.save_quantized_model(), the saved config.json includes
"rotated": true and the "fp32_had" value used during preprocessing.
load_quantized_model() uses these flags to automatically register the
required Hadamard hooks on down_proj layers.
Examples¶
Tip
Complete working examples are available in the repository:
example/pre_process/example_llama_preprocess_rtn.py-- Rotation preprocessing + RTN quantization (TinyLlama)example/pre_process/example_preprocess_save_load.py-- Rotation preprocessing + GPTQ with save/load and perplexity comparison
Limitations¶
- vLLM inference is not supported. vLLM kernels do not apply the online Hadamard transform required by rotation-preprocessed models.
- Only Llama and Qwen3 architectures are currently supported.
API Reference¶
See Pre-Process API for full parameter documentation.