QEP (Quantization Error Propagation)¶
QEP is a meta-algorithm that improves any layer-wise quantization method by compensating for the error that propagates from previously quantized layers to subsequent ones.
Reference
Yamato Arai and Yuma Ichikawa, "Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization," NeurIPS 2025. OpenReview | Original implementation
Motivation¶
Standard layer-wise PTQ quantizes each layer independently using the original input activations. However, after quantizing layer \(l\), the input to layer \(l+1\) is no longer the original activation -- it is the output of the quantized layer \(l\), which contains quantization error. This accumulated error degrades quantization quality, especially at low bit-widths.
How QEP Works¶
QEP addresses this by adjusting the weights of each layer before quantization to account for the activation error introduced by previously quantized layers.
For a layer with weight \(W\), original input activations \(X\), and quantized-model input activations \(\hat{X}\):
- Compute the activation difference: \(\Delta = X - \hat{X}\)
- Compute the cross-term: \(\Delta^T \hat{X}\)
- Solve for a weight correction \(\Delta W\) via the Hessian:
where \(H = \hat{X}^T \hat{X}\) is the Hessian matrix and \(\alpha\) is the correction
strength (perccorr).
- Quantize the adjusted weight \(W + \Delta W\) using the base quantizer (e.g., GPTQ).
Two Implementations¶
OneComp provides two QEP implementations, controlled by the QEPConfig.general parameter:
Architecture-aware (default, general=False)¶
- Exploits the structure of transformer blocks (e.g., QKV layers sharing the same input)
- Groups layers that share input activations for efficient Hessian computation
- Processes one transformer block at a time to minimize GPU memory usage
- Recommended for Llama-like architectures
Generic (general=True)¶
- Architecture-independent implementation
- Captures input activations for each layer individually
- Works with any model architecture
- Higher memory consumption and more forward passes
Usage¶
Basic QEP¶
from onecomp import ModelConfig, Runner, GPTQ
model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")
gptq = GPTQ(wbits=3)
runner = Runner(
model_config=model_config,
quantizer=gptq,
qep=True,
)
runner.run()
Custom QEP Configuration¶
from onecomp import QEPConfig
qep_config = QEPConfig(
general=False, # Architecture-aware (default)
percdamp=0.01, # Hessian damping
perccorr=0.5, # Correction strength
device="cuda:0", # GPU for QEP computation
exclude_layer_keywords=["mlp.down_proj"],
)
runner = Runner(
model_config=model_config,
quantizer=gptq,
qep=True,
qep_config=qep_config,
)
runner.run()
Generic QEP (for non-Llama architectures)¶
qep_config = QEPConfig(general=True)
runner = Runner(
model_config=model_config,
quantizer=gptq,
qep=True,
qep_config=qep_config,
)
runner.run()
Parameters¶
| Parameter | Type | Description | Default |
|---|---|---|---|
general |
bool |
Use generic (architecture-independent) QEP | False |
percdamp |
float |
Damping percentage for Hessian regularization | 0.01 |
perccorr |
float |
Correction strength (0 = no correction, 1 = full) | 0.5 |
device |
str |
GPU device for QEP computation | "cuda:0" |
exclude_layer_keywords |
list[str] |
Layer keywords excluded from error propagation | ["mlp.down_proj"] |
Note
The default exclude_layer_keywords is designed for Llama-like architectures. You may need
to adjust this for other model families.