RTN (Round-To-Nearest)¶
RTN is the simplest quantization method. It rounds each weight to the nearest quantization level without using calibration data or Hessian information.
Algorithm¶
For each weight element \(w\):
where:
- \(s\) is the scale factor
- \(z\) is the zero point
- \(b\) is the bit-width
- \(\lfloor \cdot \rceil\) denotes rounding to the nearest integer
The integer level range is always \([0, 2^b - 1]\) regardless of sym.
- Symmetric (
sym=True): max-abs symmetrisation \(x_{\max} = \max(|x_{\min}|, x_{\max})\), with zero point at \((2^b - 1 + 1) / 2\). This aligns withGPTQExcecutor. - Asymmetric (
sym=False): range includes zero (\(x_{\min} \le 0 \le x_{\max}\)), zero point = \(\lfloor -x_{\min} / s \rceil\).
When mse=True, an MSE grid search is performed to find the optimal clipping range
that minimises the Lp-norm reconstruction error.
RTN serves as a baseline for comparing more sophisticated quantization algorithms.
Parameters¶
| Parameter | Type | Description | Default |
|---|---|---|---|
wbits |
int |
Quantization bit-width | 4 |
groupsize |
int |
Group size for group-wise quantization (-1 = none) | -1 |
sym |
bool |
Symmetric quantization | False |
mse |
bool |
Enable MSE grid search for optimal clipping | False |
norm |
float |
Lp norm exponent for MSE search | 2.4 |
grid |
int |
Number of candidate shrink levels for MSE search | 100 |
Usage¶
from onecomp import ModelConfig, Runner
from onecomp.quantizer.rtn import RTN
model_config = ModelConfig(
model_id="meta-llama/Llama-2-7b-hf",
device="cuda:0",
)
rtn = RTN(wbits=4, groupsize=128)
runner = Runner(model_config=model_config, quantizer=rtn)
runner.run()
Save and Load¶
RTN uses the same tensor format as GPTQ (qweight/scales/qzeros), so it emits
quant_method="gptq" and can be saved with the standard OneComp save API, reloaded
with load_quantized_model(), and served with vLLM's built-in GPTQ plugin:
runner.save_quantized_model("./output/rtn_model")
# Load later with OneComp
from onecomp import load_quantized_model
model, tokenizer = load_quantized_model("./output/rtn_model")
See vLLM Inference for serving details.
Bit-width and vLLM
Use wbits in {2, 3, 4, 8} for vLLM serving. RTN itself accepts a wider
range of wbits, but GPTQ-compatible bit packing and vLLM serving are limited
to these bit-widths.
Rotation-preprocessed RTN models cannot be served with vLLM
The SpinQuant-style Rotation Preprocessing + RTN
flow produces models that vLLM cannot serve, because vLLM does not apply the online
Hadamard transform on down_proj inputs. Such models remain loadable with
load_quantized_model(), which auto-registers the Hadamard hooks.
Characteristics¶
- No calibration data required -- quantization is performed directly on the model weights
- Very fast -- no optimization or iterative processing
- Lower quality -- compared to GPTQ or other Hessian-based methods, RTN produces higher quantization error
- Useful as a baseline -- provides a lower bound on expected quantization quality
When to Use RTN¶
- Quick experiments where calibration data is not available
- Comparing against more advanced methods as a baseline
- High bit-width quantization (e.g., 8-bit) where the difference from optimal is small