Skip to content

QEP (Quantization Error Propagation)

QEP is a meta-algorithm that improves any layer-wise quantization method by compensating for the error that propagates from previously quantized layers to subsequent ones.

Reference

Yamato Arai and Yuma Ichikawa, "Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization," NeurIPS 2025. OpenReview | Original implementation

Motivation

Standard layer-wise PTQ quantizes each layer independently using the original input activations. However, after quantizing layer \(l\), the input to layer \(l+1\) is no longer the original activation -- it is the output of the quantized layer \(l\), which contains quantization error. This accumulated error degrades quantization quality, especially at low bit-widths.

How QEP Works

QEP addresses this by adjusting the weights of each layer before quantization to account for the activation error introduced by previously quantized layers.

For a layer with weight \(W\), original input activations \(X\), and quantized-model input activations \(\hat{X}\):

  1. Compute the activation difference: \(\Delta = X - \hat{X}\)
  2. Compute the cross-term: \(\Delta^T \hat{X}\)
  3. Solve for a weight correction \(\Delta W\) via the Hessian:
\[ \Delta W = \alpha \cdot (\Delta^T \hat{X}) \cdot H^{-1} \]

where \(H = \hat{X}^T \hat{X}\) is the Hessian matrix and \(\alpha\) is the correction strength (perccorr).

  1. Quantize the adjusted weight \(W + \Delta W\) using the base quantizer (e.g., GPTQ).

Two Implementations

OneComp provides two QEP implementations, controlled by the QEPConfig.general parameter:

Architecture-aware (default, general=False)

  • Exploits the structure of transformer blocks (e.g., QKV layers sharing the same input)
  • Groups layers that share input activations for efficient Hessian computation
  • Processes one transformer block at a time to minimize GPU memory usage
  • Recommended for Llama-like architectures

Generic (general=True)

  • Architecture-independent implementation
  • Captures input activations for each layer individually
  • Works with any model architecture
  • Higher memory consumption and more forward passes

Usage

Basic QEP

from onecomp import ModelConfig, Runner, GPTQ

model_config = ModelConfig(model_id="meta-llama/Llama-2-7b-hf", device="cuda:0")
gptq = GPTQ(wbits=3)

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
)
runner.run()

Custom QEP Configuration

from onecomp import QEPConfig

qep_config = QEPConfig(
    general=False,              # Architecture-aware (default)
    percdamp=0.01,              # Hessian damping
    perccorr=0.5,               # Correction strength
    device="cuda:0",            # GPU for QEP computation
    exclude_layer_keywords=["mlp.down_proj"],
)

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
    qep_config=qep_config,
)
runner.run()

Generic QEP (for non-Llama architectures)

qep_config = QEPConfig(general=True)

runner = Runner(
    model_config=model_config,
    quantizer=gptq,
    qep=True,
    qep_config=qep_config,
)
runner.run()

Parameters

Parameter Type Description Default
general bool Use generic (architecture-independent) QEP False
percdamp float Damping percentage for Hessian regularization 0.01
perccorr float Correction strength (0 = no correction, 1 = full) 0.5
device str GPU device for QEP computation "cuda:0"
exclude_layer_keywords list[str] Layer keywords excluded from error propagation ["mlp.down_proj"]

Note

The default exclude_layer_keywords is designed for Llama-like architectures. You may need to adjust this for other model families.