Skip to content

Algorithm Overview

Fujitsu One Compression (OneComp) provides a collection of post-training quantization (PTQ) algorithms for LLMs. Each algorithm represents a different approach to compressing model weights while preserving model quality.

What is Post-Training Quantization?

Post-training quantization converts model weights from high-precision floating-point (e.g., FP16) to lower-precision representations (e.g., INT4, INT3) after training is complete. This reduces model size and can accelerate inference without requiring retraining.

The core problem is to find quantized weights \(\hat{W}\) that minimize the error:

\[ \min_{\hat{W}} \| W X - \hat{W} X \|_F^2 \]

where \(W\) is the original weight matrix and \(X\) is the input activation matrix.

Available Algorithms

Algorithm Bit-width Calibration Description
GPTQ Arbitrary (typically 2--4) Required Hessian-based optimal rounding with column-by-column processing
DBF ~1.5 (binary) Required Double Binary Factorization: \(W \approx A \cdot \text{diag}(d) \cdot B\)
RTN Arbitrary Not required Round-To-Nearest baseline
AutoBit Mixed-precision Required ILP-based per-layer bit-width assignment under a VRAM budget
JointQ Arbitrary Required Joint optimization of assignments and scale parameters
QuIP Arbitrary Required Quantization with Incoherence Processing
ARB Arbitrary Required Adaptive Rounding with Binary search
CQ Arbitrary Required Combinatorial quantization
QBB Arbitrary Required Quantization with Block-wise Balancing
Onebit 1-bit Required Extreme 1-bit quantization

Quantization Error Propagation (QEP)

QEP is not a standalone quantizer but a meta-algorithm that works on top of any layer-wise quantizer. It compensates for the error that propagates from one layer to the next during sequential quantization.

QEP can be combined with any quantizer:

runner = Runner(
    model_config=model_config,
    quantizer=GPTQ(wbits=3),
    qep=True,
)

Layer-Projected Coordinate Descent (LPCD)

LPCD is a submodule-level refinement framework built on top of layer-wise PTQ. Instead of treating each linear layer independently, LPCD jointly optimizes related module groups such as Q/K, V/O, MLP up/down, or residual paths, then projects the refined solution back through the underlying quantizer.

LPCD can be used with or without QEP:

from onecomp import GPTQ, LPCDConfig, Runner

runner = Runner(
    model_config=model_config,
    quantizer=GPTQ(wbits=3, groupsize=128),
    qep=True,
    lpcd=True,
    lpcd_config=LPCDConfig(),
)
runner.run()

Choosing an Algorithm

  • GPTQ is the recommended default for most use cases (4-bit or 3-bit quantization)
  • GPTQ + QEP provides the best quality at low bit-widths (3-bit or lower)
  • GPTQ + QEP + LPCD is useful when you want additional submodule refinement beyond layer-wise PTQ
  • RTN is useful as a fast baseline or when calibration data is not available
  • DBF targets extreme compression (~1.5-bit) with binary factorization