AutoBit¶

AutoBit is a mixed-precision quantization method that automatically assigns optimal per-layer bit-widths under a memory budget using Integer Linear Programming (ILP).

Algorithm¶

Given a target average bit-width \(b^*\) (estimated from available VRAM or specified manually), AutoBit solves for the assignment of each layer \(\ell\) to one of \(K\) candidate quantizers:

\[ \min_{x_{\ell,k}} \sum_{\ell} \sum_{k} c_{\ell,k} \cdot x_{\ell,k} \quad \text{s.t.} \quad \sum_{\ell} \sum_{k} \text{bpw}_{\ell,k} \cdot n_\ell \cdot x_{\ell,k} \le b^* \cdot \sum_\ell n_\ell \]

where \(x_{\ell,k} \in \{0, 1\}\) indicates whether layer \(\ell\) is assigned to quantizer \(k\), \(c_{\ell,k}\) is the quantization error cost, \(\text{bpw}_{\ell,k}\) is the effective bits-per-weight, and \(n_\ell\) is the number of parameters in layer \(\ell\).

Two error metrics are supported:

RTN error: \(c_{\ell,k} = \| W_\ell - \hat{W}_{\ell,k} \|_F^2\)
Activation-aware error: \(c_{\ell,k} = \sum_{q,p} b_q \cdot a_p \cdot (\Delta W_{qp})^2\), where \(a_p\) and \(b_q\) are input and output curvature statistics collected from calibration data

When enable_fused_groups=True (the default), equality constraints ensure that vLLM fused layers (e.g., q/k/v projections, gate/up projections) receive the same quantizer assignment.

For ultra-low-bit targets (\(\le 2\) bpw), AutoBit can optionally inject DBF (Double Binary Factorization) as a fallback quantizer for the layers where GPTQ candidates would incur excessive error.

Parameters¶

Parameter	Type	Description	Default
`quantizers`	`list`	List of candidate quantizers (e.g., GPTQ at different bit-widths)	(required)
`target_bit`	`float`	Target average bit-width	`None` (auto)
`assignment_strategy`	`str`	`"activation_aware"`, `"ilp"`, or `"manual"`	`"activation_aware"`
`calibration_config`	`CalibrationConfig`	Calibration settings for activation statistics	auto
`enable_fused_groups`	`bool`	Enforce same quantizer for vLLM fused layers	`True`
`auto_dbf`	`bool`	Enable DBF fallback for ultra-low-bit targets	`True`
`dbf_threshold`	`float`	Target bit-width threshold below which DBF is injected	`2.0`
`save_path`	`str`	Path to save assignment heatmap visualization	`None`

Usage¶

VRAM-based automatic bit-width estimation¶

from onecomp import Runner

runner = Runner.auto_run(
    model_id="meta-llama/Llama-2-7b-hf",
    total_vram_gb=8,
)

Explicit target bit-width with custom candidates¶

from onecomp import GPTQ, ModelConfig, Runner
from onecomp.quantizer.autobit import AutoBitQuantizer

model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")

autobit = AutoBitQuantizer(
    target_bit=3.0,
    quantizers=[GPTQ(wbits=2), GPTQ(wbits=3), GPTQ(wbits=4)],
)

runner = Runner(model_config=model_config, quantizer=autobit)
runner.run()

Mixed bit-width and group size¶

autobit = AutoBitQuantizer(
    target_bit=3.0,
    quantizers=[
        GPTQ(wbits=2, groupsize=32),
        GPTQ(wbits=4, groupsize=128),
        GPTQ(wbits=4, groupsize=32),
    ],
)

VRAM Estimation¶

When target_bit is not specified, Runner.auto_run() uses estimate_wbits_from_vram() to derive the target from available GPU memory. This accounts for model size, KV cache, and inference overhead to find the largest model that fits in the VRAM budget.

vLLM Compatibility¶

AutoBit emits a mixed_gptq-compatible quantization_config, allowing quantized models to be served directly with vLLM via the built-in Mixed-GPTQ plugin. The enable_fused_groups constraint ensures that fused layers (qkv_proj, gate_up_proj) have matching bit-widths, which is required by vLLM.