Pre-Process (Rotation Preprocessing)¶
Rotation preprocessing applies SpinQuant/OstQuant-style rotation matrices to model weights before quantization, reducing quantization error. This is particularly effective for low-bit quantization (e.g., 3-bit).
Overview¶
The rotation preprocessing pipeline:
- Trains rotation/scaling matrices using calibration data with an RTN quantization proxy
- Absorbs the learned matrices into model weights (fuses LayerNorms, rotates projections)
- Registers online Hadamard hooks on
down_projlayers for inference correctness - Saves the rotated model as a standard Hugging Face model directory
The saved model can then be quantized with any quantizer (GPTQ, RTN, etc.) using the
standard Runner pipeline.
Quick Start¶
from onecomp import ModelConfig, Runner, GPTQ, prepare_rotated_model, setup_logger
setup_logger()
# Step 1: Rotation preprocessing
model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")
rotated_config = prepare_rotated_model(
model_config=model_config,
save_directory="./rotated_model",
wbits=4,
groupsize=128,
)
# Step 2: Quantize (wbits/groupsize/sym must match Step 1)
gptq = GPTQ(wbits=4, groupsize=128)
runner = Runner(model_config=rotated_config, quantizer=gptq)
runner.run()
Supported Architectures¶
| Architecture | Status |
|---|---|
| Llama | Supported |
| Qwen3 | Supported |
Key Parameters¶
| Parameter | Description | Default |
|---|---|---|
rotation |
Apply rotation matrices (R1, R2) | True |
scaling |
Apply scaling diagonals (S_*) | False |
enable_training |
Train rotation matrices (vs. random init) | True |
wbits |
RTN proxy bit-width (must match quantizer) | 4 |
groupsize |
RTN proxy group size (must match quantizer) | -1 |
sym |
RTN proxy symmetric quantization | False |
fp32_had |
Use FP32 for online Hadamard transform | False |
seed |
Seed for rotation init and calibration data | 0 |
Parameter matching
The wbits, groupsize, and sym parameters must match the quantizer
used in Step 2. Mismatched values will degrade quantization quality because
the rotation matrices were optimized for different quantization settings.
Save and Load¶
Rotation-preprocessed quantized models support the standard save/load API:
# Save
runner.save_quantized_model("./quantized_model")
# Load (Hadamard hooks are auto-registered)
from onecomp import load_quantized_model
model, tokenizer = load_quantized_model("./quantized_model")
The saved config.json includes "rotated": true and "fp32_had": false,
which load_quantized_model() uses to automatically register the required
Hadamard hooks on down_proj layers.
Limitations¶
- vLLM inference is not supported. vLLM kernels do not apply the online Hadamard transform required by rotation-preprocessed models.
- Only Llama and Qwen3 architectures are currently supported.
API Reference¶
See Pre-Process API for full parameter documentation.