Skip to content

Fujitsu One Compression

Open-source Python library for post-training quantization of Large Language Models


Fujitsu One Compression (OneComp) is an open-source Python library for post-training quantization of Large Language Models (LLMs). It implements state-of-the-art quantization algorithms including GPTQ, DBF, RTN, and the novel Quantization Error Propagation (QEP) method proposed in our NeurIPS 2025 paper.

Key Features

  • Quantization Error Propagation (QEP) -- A post-training quantization method that corrects quantization errors by propagating them to subsequent layers, improving the accuracy of quantized LLMs. See Arai & Ichikawa, NeurIPS 2025 for details.
  • vLLM Plugin Integration -- Serve OneComp-quantized models with vLLM via built-in plugins for DBF and Mixed-GPTQ quantization methods.
  • AutoBit -- Mixed-precision quantization with ILP-based bitwidth assignment. Automatically estimates the target bitwidth from available VRAM and assigns per-layer bitwidths to minimize quantization error under the memory budget.
  • JointQ -- Joint quantization method that optimizes weight assignments and scale parameters simultaneously for improved quantization accuracy. Supports group-wise quantization (e.g., 4-bit, groupsize=128).
  • LoRA SFT Post-Process -- Fine-tune quantized models with LoRA adapters for accuracy recovery or domain-specific knowledge injection. Supports SFT loss, teacher distillation, and intermediate block alignment.
  • Rotation Preprocessing -- SpinQuant/OstQuant-based rotation preprocessing that reduces quantization error by learning optimal rotation matrices before quantization. Rotation/scaling matrices are absorbed into model weights, with online Hadamard hooks automatically registered at load time. Supports Llama and Qwen3 architectures.

Supported Models

OneComp has been verified with the following model architectures. Other Hugging Face-compatible models may work but are currently untested.

# Architecture Verified Models Status
1 Llama TinyLlama, Llama-2, Llama-3 :white_check_mark: Verified
2 Qwen3 Qwen3-0.6B ~ 32B :white_check_mark: Verified

Note

Support for additional architectures is planned. Contributions and test reports are welcome.

Quick Example

Quantize any Hugging Face model in a single line -- with QEP, GPTQ 4-bit quantization, evaluation (perplexity + accuracy), and model saving all handled automatically:

from onecomp import Runner

Runner.auto_run(model_id="meta-llama/Llama-2-7b-hf")
onecomp meta-llama/Llama-2-7b-hf

For full control over each step, see the step-by-step workflow.

Getting Started

Citation

If you use OneComp in your research, please cite our paper:

OneComp technical report (coming soon on ArXiv):

@misc{onecomp2026,
  title={TBD},
  author={TBD},
  year={2026},
  note={arXiv preprint coming soon}
}

QEP (Quantization Error Propagation):

@inproceedings{
arai2025quantization,
title={Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization},
author={Yamato Arai and Yuma Ichikawa},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=a3l3K9khbL}
}

License

Fujitsu One Compression is released under the terms of the LICENSE file included in the repository.

Copyright 2025-2026 Fujitsu Ltd.