Changelog¶
Change log¶
[v1.0.2] 2026-03-31¶
Bug Fix¶
- Fixed
ImportErrorwhen runningonecompCLI without matplotlib installed;AutoBitQuantizer._visualize()now catches the import error and logs a warning instead of crashing
[v1.0.1] 2026-03-31¶
Packaging¶
- Moved
matplotlibfromdevextra to newvisualizeextra inpyproject.toml - Made
visualize_bit_assignmentimport lazy inonecomp/quantizer/autobit/__init__.pyto avoid requiring matplotlib at import time - Updated installation instructions in
README.mdanddocs/getting-started/installation.mdto reflect the newvisualizeextra - Updated
uv.lock
[v1.0.0] 2026-03-31¶
PyPI Publishing Setup¶
- Added PyPI metadata to
pyproject.toml:keywords,classifiers, andproject.urls(Homepage, Documentation, Repository, Bug Tracker, Changelog) - Removed
gemliteoptional-dependency extra that used direct git URLs (PEP 440 violation); equivalent packages are already in maindependencies - Added
.github/workflows/publish.yml: automated PyPI publishing via Trusted Publishers (OIDC) on GitHub Release - Updated
README.md: installation command changed frompip install git+<URL>topip install onecomp - Added
dist/andbuild/to.gitignore - Updated
uv.lock
Default Parameter Changes¶
- Changed
Runner.__init__default values for calibration parameters: max_length:512→2048num_calibration_samples:128→512- Pinned old default values explicitly in all
example/andtests/files that previously relied on the defaults
Documentation¶
- Updated
docs/user-guide/configuration.mdto reflect the new default values formax_lengthandnum_calibration_samples - Added quantizer feature support table to
docs/user-guide/basic-usage.mdanddocs/api/quantizers/base.md - Documents which quantizers support
save_quantized_model()/create_quantized_model()and quantized-model PPL/ACC evaluation - Currently supported: GPTQ, DBF, AutoBitQuantizer (requires
get_quant_config()andcreate_inference_layer()) - Unsupported quantizers (RTN, JointQ, QUIP, CQ, ARB, QBB, Onebit): PPL/ACC evaluation automatically falls back to the dequantized (FP16) model
- Updated the perplexity/accuracy evaluation note in
basic-usage.mdto reflect AutoBitQuantizer support and fallback behavior
[v0.5.0] 2026-03-30¶
New Feature: Post-quantization Workflow¶
- Added
PostQuantizationProcessabstract base class (onecomp/post_process/_base.py) - Defines the interface for post-quantization operations (e.g. block-wise PTQ, fine-tuning)
- Added
post_processesparameter toRunner.__init__ - Accepts a list of
PostQuantizationProcessinstances - After quantization, builds a quantized model on CPU and executes each process in order
- The processed model is stored as
self.quantized_model - Updated
Runner.calculate_perplexityandRunner.calculate_accuracyto useself.quantized_modelif available (GPU transfer is handled automatically;device="auto"is resolved to"cuda") - Added LoRA SFT post-process implementation (
onecomp/post_process/post_process_lora_sft.py) - Provides learning-based post-quantization fine-tuning for GPTQ-quantized models
- Public API is exposed as
PostProcessLoraSFT
New Feature: Rotation Preprocessing Pipeline (onecomp/pre_process/)¶
SpinQuant/OstQuant-based rotation preprocessing that reduces quantization error by learning optimal rotation matrices before quantization. Supports Llama and Qwen3 architectures.
- Added
prepare_rotated_model()(onecomp/pre_process/prepare_rotated_model.py): End-to-end pipeline — model loading → rotation/scaling training → rotation application → saving - Memory-optimized: moves model between CPU/GPU to reduce peak memory (e.g. Qwen3-32B: ~128GB → ~64GB)
- Added
RotatedModelConfig(onecomp/rotated_model_config.py):ModelConfigsubclass that automatically registers Hadamard hooks ondown_projlayers duringload_model() - Added
onecomp/pre_process/package: train_rotation.py: Training pipeline withPreprocessManager(R1/R2/S_* tensor management), HFTrainersubclass,apply_preprocess_train/apply_preprocess_evaloptimizer.py:SGDG— SGD on the Stiefel manifold with Cayley-retraction orthogonal updates (ported from SpinQuant)quant_models.py:WeightQuantizer(RTN proxy) with per-channel / per-tensor / group-wise quantization; quantized decoder layers for Llama and Qwen3rotation_utils.py:fuse_layer_norms,rotate_model,register_online_hadamard_hookshadamard_utils.py: Hadamard transform utilities and pre-computed matrices (ported from QuIP#)modeling_llama.py/modeling_qwen3.py: CustomForCausalLMclasses that propagate R1 through the forward pass during trainingpreprocess_args.py:TrainingArgumentssubclass with SGDG-specific LR/momentum fields- Fixed
_PreprocessTrainerto overridecreate_optimizer()instead ofcreate_optimizer_and_scheduler()for transformers >= 5.x compatibility (SGDG optimizer was silently replaced by AdamW) - Updated
Runner.save_dequantized_model()andRunner.save_quantized_model()to warn when saving models loaded with additional preprocessing (e.g., Hadamard hooks)
Added JointQ Quantizer¶
- Added new
JointQquantizer (onecomp/quantizer/jointq/) - Local-search-based post-training quantization method that minimizes ||Y - hat{W} X^T||_F^2
- Supports both symmetric and asymmetric quantization (1–4 bits)
- Group-wise quantization with configurable group size
- Tikhonov regularization for over-fitting (X^T X + nλI)
- Three initialization strategies: Clip-Optimize, Clip-Optimize with Error Propagation, and GPTQ
AutoBitQuantizer vLLM-compatible quantization_config¶
AutoBitQuantizernow emitsmixed_gptq-compatiblequantization_config(onecomp/quantizer/autobit/_autobit.py)- ILP solver now enforces fused-layer equality constraints (
onecomp/quantizer/autobit/ilp.py) - vLLM fuses q/k/v →
qkv_projand gate/up →gate_up_proj
API changes¶
- Made
Runner.create_quantized_model()a public method (renamed from_create_quantized_model) - Builds a quantized model with quantized inference layers from
quantizer.results - Returns
(model, tokenizer)for use in evaluation, saving, and post-process workflows - Added
Runner.save_quantized_model_pt()for saving post-processed models (e.g. LoRA-applied) as PyTorch.ptfiles - Uses
torch.saveto preserve custom module types such asLoRAGPTQLinear - Saves tokenizer files alongside the model
- Added
QuantizedModelLoader.load_quantized_model_pt()for loading.pt-format models - Counterpart to
save_quantized_model_pt; usestorch.loadto restore models with custom modules - Also available as
onecomp.load_quantized_model_pt()convenience alias
Bug Fix: Onebit Quantizer¶
- Fixed
Onebitto declareflag_calibration=Trueandflag_hessian=True(onecomp/quantizer/onebit/_onebit.py) - Previously, Onebit computed the Hessian internally from
inputdespite declaring all flags asFalse, causing a crash when used throughquantize_without_calibrationor chunked quantization paths - Now uses the Hessian provided by the Runner, consistent with other calibration-based quantizers (GPTQ, DBF, QUIP)
Quantizer Signature Consistency¶
- Added
input=Nonedefault toquantize_layerinRTN,CQ,QBB(onecomp/quantizer/{rtn,cq,qbb}/) - Aligns with the base
Quantizer.quantize_layer(self, module, input=None, hessian=None)signature - Enables these quantizers to be used in
Runner(quantizers=[...])via the chunked quantization path - Added
input=None, hessian=Nonedefaults toOnebit.quantize_layerfor the same reason
Examples¶
- Added
example/post_process/example_lora_sft.py: End-to-end demo — GPTQ 4-bit quantization + LoRA SFT (WikiText-2) + PPL evaluation + save/load withsave_quantized_model_pt/load_quantized_model_pt - Added
example/post_process/example_lora_sft_knowledge.py: Knowledge injection demo — teaches the quantized model about "OneCompression" via LoRA SFT and compares generation before/after - Added
example/post_process/onecomp_knowledge.jsonl: Training data describing OneCompression for the knowledge injection example - Added
example/example_jointq.py: JointQ 4-bit (groupsize=128) quantization example with dequantized model PPL evaluation - Added
example/pre_process/example_llama_preprocess_rtn.py: Rotation preprocessing + RTN quantization (TinyLlama-1.1B) - Added
example/pre_process/example_preprocess_save_load.py: Rotation preprocessing + GPTQ quantization → save → load → PPL verification - Added
example/vllm_inference/example_gptq_vllm_inference.py: GPTQ + QEP quantization and vLLM inference end-to-end example - Added
example/vllm_inference/example_autobit_vllm_inference.py: AutoBit mixed-precision quantization and vLLM inference example
Documentation¶
- Added
docs/user-guide/post-process.md: LoRA SFT user guide covering accuracy recovery, knowledge injection, save/load, key parameters, teacher distillation, intermediate block alignment, and vLLM limitations - Added
docs/api/post_process.md: API reference forPostQuantizationProcess,PostProcessLoraSFT, and convenience variants - Updated
docs/user-guide/examples.mdwith LoRA SFT code examples (accuracy recovery, knowledge injection, save/load) - Updated
docs/api/runner.mdto includecreate_quantized_modelandsave_quantized_model_pt - Updated
docs/api/quantized_model_loader.mdto includeload_quantized_model_pt - Updated
mkdocs.ymlnavigation with new post-process pages - Added
docs/user-guide/pre-process.md: Rotation preprocessing user guide covering workflow, key parameters, save/load, and limitations - Added
docs/api/pre_process.md: API reference forprepare_rotated_modelandRotatedModelConfig - Updated
docs/user-guide/examples.mdwith rotation preprocessing code examples (RTN, GPTQ with save/load) - Updated
docs/api/index.mdwithRotatedModelConfig,prepare_rotated_model, andpre_process/module structure - Updated
docs/index.mdKey Features with rotation preprocessing
Tests¶
- Added smoke test for
PostProcessLoraSFT(tests/onecomp/post_process/test_post_process_lora_sft.py) - Verifies that
PostProcessLoraSFT.run()completes without error on TinyLlama with minimal settings - Checks LoRA layer injection, CPU placement, and eval mode after run
- Includes Runner end-to-end integration test with
post_processesparameter - Expanded and updated unit tests for DBF quantizer (
tests/onecomp/quantizer/dbf/test_dbf.py) - Extended boundary and abnormal parameter cases; aligned with
BaseQuantizeSpecand current DBF API - Expanded and updated unit tests for GPTQ quantizer (
tests/onecomp/quantizer/gptq/test_gptq.py) - Extended boundary and abnormal parameter cases; aligned with
BaseQuantizeSpecand current GPTQ API - Adjusted DBF and GPTQ quantizer implementations for test compatibility and consistency (
onecomp/quantizer/dbf/_dbf.py,onecomp/quantizer/gptq/_gptq.py) - Fixed and improved JointQ unit tests (
tests/onecomp/quantizer/jointq/test_jointq.py) - Use
compute_dequantized_weight()instead of directdequantized_weightaccess - Override boundary test to use CUDA with 128×128 layers for group_size compatibility
- Skip CPU-only tests (JointQ is GPU-based)
- Fix
batch_sizevalidation:>= 0→>= 1(onecomp/quantizer/jointq/_jointq.py) - Improved JointQ regression test (
tests/onecomp/quantizer/jointq/test_quantize_regression.py) - Replaced exact tensor match with MSE-based quality check for environment portability
- Hardcoded expected MSE in helper; removed
.pthbaseline file
[v0.4.3] 2026-03-26¶
Implement AutoBit to automatically determine bit-allocation¶
- Add
AutoBitQuantizer(onecomp/quantizer/autobit/_autobit.py) that automatically assigns optimal bit-width per module using ILP with considering activation-aware error (onecomp/quantizer/autobit/ilp.py) and DBF fallback (onecomp/quantizer/autobit/dbf_fallback.py) for ultra-low-bit targets ( <= target bit 2bit) - SCIP solver was utilized to solve ILP (
onecomp/quantizer/autobit/ilp.py) - Sequentially load and forward each layer to collect activation and curvature statistics (
onecomp/quantizer/autobit/activation_stats.py,onecomp/utils/blockwise.py) - Usage example is shown in (
example/example3.py) - Add VRAM auto-estimation utility to derive target bit-width from available GPU memory (
onecomp/utils/vram_estimator.py)
VLM and Multi-Architecture Support for Architecture-aware QEP¶
- Extended
_get_blocksto detectlanguage_modelsub-module and restrict block search to the text decoder (onecomp/qep/_quantize_with_qep_arch.py) - VLMs (Qwen3-VL, Gemma3, etc.) no longer return vision-encoder blocks
- CausalLM behaviour is unchanged (falls back to full-model search)
- Added
__getattr__proxy toCatcherto forward attribute access to the wrapped module (onecomp/qep/_quantize_with_qep_arch.py) - Prevents
AttributeErrorwhen model code reads decoder-layer attributes (e.g.attention_type) beforeforward() - Changed
get_blocks_and_inputsto capture block-level kwargs with batch=1 (onecomp/qep/_quantize_with_qep_arch.py) - Internally generated kwargs (position_embeddings, attention_mask, etc.) are now batch-size-independent
- Avoids shape mismatches when reused with varying batch sizes in downstream functions
- Added
expand_kwargs_batchhelper to expand batch=1 kwargs viaTensor.expand(zero-copy view) (onecomp/qep/_quantize_with_qep_arch.py) - Used in
compute_hessian_and_crosstermandforward_inputbefore each block forward call - Resolves failures on models requiring exact batch-dimension matching (e.g. Gemma3 sliding-window attention)
- Added early termination and group skipping to
run_quantize_with_qep_arch(onecomp/qep/_quantize_with_qep_arch.py) - Groups with no quantization targets are skipped (avoids unnecessary Hessian/cross-term computation)
- Block loop exits once all target layers are quantized
End-to-end CLI tests¶
- Added
tests/onecomp/test_cli.py: end-to-end tests that verifyonecomp TinyLlama/...CLI runs without errors test_default_full_run: full default pipeline (AutoBit + QEP + eval + save) on GPU- Variant tests for individual options (
--wbits,--no-qep,--total-vram-gb,--groupsize,--save-dir, etc.) on CPU - Variant tests are skipped by default; enable with
RUN_CLI_VARIANT_TESTS=1 - Uses
python -m onecompto avoid implicituv syncthat could modify the environment
Fixes¶
- Fixed crash when DBF quantization fails with NaN/Inf (
onecomp/quantizer/dbf/_dbf.py,onecomp/qep/_quantize_with_qep_arch.py) _quantize_with_qep_arch.py: CatchValueError/NotImplementedErrorfromcompute_dequantized_weight(), log the error, and keep QEP-adjusted weights for the failed layer- Fixed GemLite import crash when PyTorch version is incompatible (
onecomp/quantizer/gemlite.py) - Broadened
except ImportErrortoexcept (ImportError, AttributeError)so that GemLite gracefully falls back whentorchlacks newer dtypes (e.g.float8_e8m0fnu) - Fixed
test_dbf_gemlite.pyto skip when GemLite is unavailable instead of crashing (tests/vllm-plugins/dbf/test_dbf_gemlite.py)
Dependency and documentation updates¶
- Added
vllmas an optional dependency (--extra vllm) inpyproject.toml - Prevents environment corruption caused by
uv pip install vllmbeing overwritten by subsequentuv sync/uv run - Added
torchvisionto CUDA extras and[tool.uv.sources]inpyproject.tomlto prevent CUDA version mismatch - Updated installation docs to reflect new extras (
README.md,docs/getting-started/installation.md,docs/user-guide/vllm-inference.md) - Updated
uv.lock
[v0.4.2] 2026-03-25¶
Unit tests for additional quantizers¶
- Added unit tests for QBB, RTN, QUIP, ONEBIT, CQ, ARB, and JOINTQ
- New test modules under
tests/onecomp/quantizer/:test_qbb.py,test_rtn.py,test_quip.py,test_onebit.py,test_cq.py,test_arb.py,test_jointq.py - Shared test base and helpers updated in
tests/onecomp/quantizer/test_module.py - Quantizer implementations adjusted for test compatibility:
onecomp/quantizer/qbb/,onecomp/quantizer/rtn/,onecomp/quantizer/quip/,onecomp/quantizer/onebit/,onecomp/quantizer/arb/,onecomp/quantizer/jointq/(and related*_impl.py); minor updates inonecomp/quantizer/dbf/_dbf.py,onecomp/quantizer/gptq/_gptq.py
vLLM plugin integration (DBF, Mixed-GPTQ)¶
- Added vLLM plugin implementation for DBF and Mixed-GPTQ
- New
vllm_pluginspackage:vllm_plugins/__init__.py, DBF and GPTQ plugin entry points (vllm_plugins/dbf/,vllm_plugins/gptq/) - DBF:
vllm_plugins/dbf/vllm_plugin.pyand modules (vllm_plugins/dbf/modules/gemlite_linear.py,vllm_plugins/dbf/modules/naive.py); shared utilities invllm_plugins/utils/module.py - GPTQ:
vllm_plugins/gptq/vllm_plugin.pyfor Mixed-GPTQ inference - Tests:
tests/vllm-plugins/dbf/test_dbf_gemlite.py,tests/vllm-plugins/dbf/test_dbf_naive.py - Package and dependency wiring in
pyproject.toml
Fixes¶
- Mixed-GPTQ: raise an error when quantization bit widths differ within the same shard (align with DBF behavior) (
vllm_plugins/gptq/vllm_plugin.py)
[v0.4.1] 2026-03-19¶
Mixed GPTQ/DBF Save/Load¶
- Extended Save/Load for mixed GPTQ and mixed DBF
QuantizedModelLoadernow loads models withquant_methodmixed_gptqormixed_dbf(onecomp/quantized_model_loader.py)effective_methodtreats mixed_* as the same tensor format as the base method (gptq/dbf) and resolves per-layer bit-width viaquantization_bits- Load validates
quant_method,quantization_bits, andmodules_in_block_to_quantizefromconfig.json'squantization_config - GPTQ
- Added
get_quant_config()to return save-timequantization_configwith vLLM-compatible keys (onecomp/quantizer/gptq/_gptq.py) - Sets
quant_methodtomixed_gptqwhenmodule_wbitsormlp_wbitsis present - New
onecomp/quantizer/gptq/config.py:resolve_gptq_layer_wbits()resolves per-layer bit-width fromquantization_config(priority: quantization_bits → module_wbits → mlp_wbits → bits/wbits) GPTQLinear: extended to accept bit-width when restoring from saved state (onecomp/quantizer/gptq/gptq_layer.py)- DBF
- Added
get_quant_config()to return save-timequantization_config(onecomp/quantizer/dbf/_dbf.py) - New
onecomp/quantizer/dbf/config.py:resolve_dbf_layer_bits()resolves per-layer bit-width fromquantization_config(priority: quantization_bits → module_target_bits → mlp_target_bits → bits) DoubleBinaryLinear: added argument for target bit-width (for mixed_dbf) (onecomp/quantizer/dbf/dbf_layer.py)- Shared
onecomp/utils/quant_config.py: added common helperget_quant_param()forquantization_configschema (fetch params by alias keys)Quantizer.finalize_quant_config_for_save()hook added; subclasses (GPTQ/DBF) inject method-specific metadata (onecomp/quantizer/_quantizer.py)runner: setquantization_configwhen saving (onecomp/runner.py)
Evaluation and benchmark (Runner and accuracy utils)¶
- Runner: unified perplexity/accuracy evaluation via
_calculate_evaluation()and added optionaldequantized_modelevaluation (onecomp/runner.py) - BREAKING:
calculate_perplexity()/calculate_accuracy()now return a 3-tuple(original, dequantized, quantized)instead of 2-tuple(original, quantized). Existing code usingorig, quant = runner.calculate_perplexity()must be updated to unpack three values. (onecomp/runner.py) - BREAKING:
calculate_perplexity()/calculate_accuracy()default fororiginal_modelchanged fromTruetoFalse. To evaluate the original model, passoriginal_model=Trueexplicitly. (onecomp/runner.py) - Benchmark:
benchmark_perplexity()/benchmark_accuracy()now acceptdequantized_modelandquantized_modelarguments. Whendequantized_model=True, the result dict includes"{name}_dequantized"keys. (onecomp/runner.py) - lm_eval: added helper to create
HFLMwhile temporarily disablingmodel.config.quantization_configfor compatibility (onecomp/utils/accuracy.py)
Dequantized-weight API and compatibility fixes¶
- Implemented
compute_dequantized_weight()for GPTQ and DBF quantizers (onecomp/quantizer/gptq/_gptq.py,onecomp/quantizer/dbf/_dbf.py) - Removed
dequantized_weightfrom Result classes and switched call sites to compute it viacompute_dequantized_weight()(onecomp/quantizer/_quantizer.py,onecomp/runner_methods/*) - Fixed compatibility for quantization methods other than DBF/GPTQ in runner and QEP paths (
onecomp/runner.py,onecomp/qep/_quantize_with_qep*.py) - Updated unit tests accordingly (
tests/onecomp/test_qep_general_consistency.py)
auto_run / CLI improvements¶
Runner.auto_run(): addedeval_original_modelparameter to optionally evaluate the original (unquantized) model's perplexity and accuracy (default:False) (onecomp/runner.py)Runner.auto_run(): evaluation now only computes quantized model metrics by default; passeval_original_model=Trueto include original model metrics- CLI: added
--eval-originalflag toonecompcommand (onecomp/cli.py)
GPU memory optimization for model saving¶
save_quantized_model()/save_dequantized_model()now load the base model on CPU (device_map="cpu") instead of GPU when building the save artifact (onecomp/runner.py). Previously the full original model was loaded onto GPU, which was unnecessary for saving and could cause OOM on memory-constrained setups.
Bug fix: Architecture-aware QEP group alignment¶
- Fixed non-deterministic crash in
compute_hessian_and_crosstermcaused bygroups_qandgroups_fbeing ordered differently (onecomp/qep/_quantize_with_qep_arch.py).make_grouped_modulegroups modules by tensor identity (id()+data_ptr()), but aftercopy.deepcopythe CUDA memory allocator can assign different addresses, causing group misalignment between the quantized and full-precision blocks. Nowgroups_fis derived fromgroups_qby module name lookup instead of independent grouping.
Other fixes in this release¶
- Refactored runner evaluation paths and fixed benchmark-based evaluation behavior (
onecomp/runner.py,onecomp/utils/accuracy.py) - Examples: updated to pass
original_model=Trueandquantized_model=Trueexplicitly, and to unpack the new triple return value (example/example1.py,example/example2.py)
[v0.4.0] 2026-03-20¶
New Feature: Runner.auto_run() Classmethod¶
- Added
Runner.auto_run()classmethod for one-liner quantization (onecomp/runner.py) - Handles model loading, GPTQ quantization with QEP, evaluation (perplexity + accuracy), and model saving in a single call
- Parameters:
model_id,wbits(default: 4),groupsize(default: 128),device,qep(default: True),evaluate(default: True),save_dir(default: "auto") - Returns the configured
Runnerinstance for further analysis - Made
model_configparameter optional inRunner.__init__()(default:None) to allowRunner()without arguments
New Feature: onecomp CLI Command¶
- Added
onecompCLI command for terminal-based quantization (onecomp/cli.py) - Usage:
onecomp <model_id> [--wbits N] [--groupsize N] [--device DEV] [--no-qep] [--no-eval] [--save-dir DIR] - Thin wrapper around
Runner.auto_run() - Added
onecomp/__main__.pyforpython -m onecompsupport - Registered
console_scriptsentry point inpyproject.toml
New Example¶
- Added
example/example_auto_run.pydemonstrating one-liner quantization withRunner.auto_run()
Documentation¶
- Updated
docs/index.md: Quick Example now showsauto_runand CLI with tabbed view - Restructured
docs/getting-started/quickstart.md:auto_run/ CLI as the fastest path, step-by-step workflow below - Updated
docs/getting-started/installation.md: Addedonecompcommand examples to Running Commands section - Updated
docs/user-guide/basic-usage.md: Added "Quick Path:Runner.auto_run()" section - Updated
docs/user-guide/examples.md: Addedauto_runand CLI examples at the top - Added
docs/user-guide/cli.md: Full CLI reference with all options and usage examples - Updated
docs/api/runner.md: Addedauto_runto mkdocstrings members - Updated
docs/api/index.md: Addedcli.pyand__main__.pyto Module Structure - Updated
mkdocs.yml: Added CLI page to navigation - Added "Building Documentation Locally" section to
README.md
Python Version Constraint¶
- Restricted
requires-pythonto">=3.12, <3.14"inpyproject.toml - PyTorch does not yet provide wheels for Python 3.14, causing
uv syncto fail when uv auto-selects CPython 3.14 - Updated
uv.lockto reflect the new Python version constraint
[v0.3.7] 2026-03-16¶
GPU Memory Optimization for Architecture-aware QEP¶
- Added
devicefield toQEPConfig(onecomp/qep/_qep_config.py) - Specifies the GPU device for block-wise QEP computation (default:
"cuda:0") - Eliminates dependency on
model_config.deviceand supports multi-GPU environments - Added
device_mapparameter toModelConfig.load_model()(onecomp/model_config.py) - Allows overriding the device placement at load time without affecting existing callers
- Optimized
run_quantize_with_qep_archto avoid loading the entire model onto GPU (onecomp/qep/_quantize_with_qep_arch.py) - Model is now loaded on CPU via
load_model(device_map="cpu") - Calibration data is prepared on CPU
- Only individual transformer blocks are moved to GPU during processing
- Added
StopForwardexception and modifiedCatcherto halt the forward pass immediately after capturing first-block inputs, avoiding unnecessary computation through remaining layers (onecomp/qep/_quantize_with_qep_arch.py) - Added
move_kwargs_to_devicehelper to recursively move keyword arguments to the target device (onecomp/qep/_quantize_with_qep_arch.py) - Fixed
UnboundLocalErrorwhen a module in a group is not registered inquantizer.module_to_name(onecomp/qep/_quantize_with_qep_arch.py)
[v0.3.6] 2026-03-12¶
Completion of Save/Load Pipeline¶
- Added new
QuantizedModelLoaderclass (quantized_model_loader.py) - Automatically detects quantization config (GPTQ/DBF) from
config.jsonand loads the model - Reads state_dict from safetensors, replaces layers with quantized layers, and loads into an empty model
- Supports automatic device placement via
accelerate - Top-level API: exported as
onecomp.load_quantized_model() - Added
GPTQLinear.from_saved_state()(reconstructs layer from safetensors state_dict) - Added
DoubleBinaryLinear.from_saved_state()(same as above) - Revised
config.jsonoutput format to enable direct inference with vLLM - Added list of quantized layer names to
modules_in_block_to_quantize
Forward Implementation for DoubleBinaryLinear and GPTQLinear¶
GPTQLinear.forward(): Unpacks bit-packed weights → dequantizes → infers viaF.linear()(fast path when using GemLite)DoubleBinaryLinear.forward(): Implements 5-stage pipeline (scaling0 → binary_B → scaling2 → binary_A → scaling4) (GemLite compatible)
Expansion of Unit Tests¶
- Added new common test base class
BaseQuantizeSpec(test_module.py) test_quantize_layer_returns: Validates type, shape, device, and dtype of quantization results (CPU/CUDA)test_quantize_layer_reproducibility: Validates reproducibility with the same seedtest_parameters_boundary: Confirms correct behavior with boundary parameter valuestest_parameters_abnormal_values_raise: Confirms exceptions are raised for abnormal parameterstest_cpu_gpu_output_match: Validates that CPU/GPU quantization results matchtest_quantize_error: Validates quantization error is within tolerance on a 2-layer modeltest_forward_error: Validates forward accuracy of inference layer (dequantized output vs inference layer output)- Added dedicated test classes for GPTQ and DBF (
test_gptq.py,test_dbf.py)
Fixes to DBF and GPTQ Quantizers¶
- Added parameter validation mechanism via
validate_params()duringsetup()forDBFandGPTQ - Unified and revised dtype (FP16/INT32) and device (CPU) of quantization results
Build System Updates¶
- Migrated package and project management to
uvandpyproject.toml. - Applied
blacklinter to scripts.
QEP Module Refactoring¶
- Added
QEPConfigdataclass (onecomp/qep/_qep_config.py) - Extracted
quantize_with_qeplogic into standalone function (onecomp/qep/_quantize_with_qep.py) - Added
generalflag toQEPConfigfor dispatching between generic and architecture-aware implementations - Added stub for architecture-aware QEP quantization (
onecomp/qep/_quantize_with_qep_arch.py) - Implemented architecture-aware QEP quantization with block-wise sequential pipeline (
onecomp/qep/_quantize_with_qep_arch.py) - Added helper functions:
_get_blocks,get_blocks_and_inputs,make_grouped_module,compute_hessian_and_crossterm,forward_input - Added
Catcherclass for capturing input activations of transformer blocks - Groups layers sharing the same input activations for efficient Hessian/cross-term computation
- Extended
Quantizer.quantize_with_qep()andadjust_weight()to accept precomputedhessiananddelta_hatX(onecomp/quantizer/_quantizer.py) - Fixed
_record_quantization_errorto handlequant_input_activation=Nonefor architecture-aware QEP (onecomp/quantizer/_quantizer.py) - Fixed architecture-aware QEP to respect
num_layersand layer selection by checkingquantizer.module_to_name(onecomp/qep/_quantize_with_qep_arch.py) - Fixed architecture-aware QEP to support
exclude_layer_keywords: excluded layers are quantized without weight correction (onecomp/qep/_quantize_with_qep_arch.py) - Added consistency test between generic and architecture-aware QEP implementations (
tests/onecomp/test_qep_general_consistency.py) - BREAKING: Changed
QEPConfig.generaldefault fromTruetoFalse(architecture-aware implementation is now the default)
GPTQ Refactoring (onecomp/quantizer/gptq/_gptq.py)¶
- BREAKING: Changed default
symfromFalsetoTrue(symmetric quantization) for bothGPTQclass andrun_gptq()function. Code relying on the previous asymmetric default must now explicitly passsym=False. - Expanded
GPTQclass docstring with full attribute descriptions and usage examples - Renamed
Hparameter tohessianinrun_gptq()for clarity - Renamed local variable
Wtomatrix_Winrun_gptq()for clarity - Changed imports to
fromstyle (from torch import nn,from transformers import Conv1D) - Refactored
GPTQExcecutor.__init__: replacedregister_bufferwith explicitNoneinitialization for all attributes - Added docstrings to
GPTQExcecutor.quantize(),enabled(), andready()methods - Updated
test_gptq.pyboundary/abnormal parameters to reflect newsym=Truedefault
[v0.3.5] 2026-03-05¶
- Based on v0.3.4 codebase
- Difference from v0.3.4: Changed comments to English