Changelog¶

Change log¶

[v1.2.2] 2026-07-10¶

Bug Fix¶

Fixed device="auto" evaluation crashes in Runner.calculate_perplexity() and related paths by adding ModelConfig.get_device() and using a resolved torch.device for PyTorch operations such as model.to() and empty_cache(). Hugging Face device_map="auto" remains unchanged for model loading, but PyTorch no longer receives the raw "auto" string.

[v1.2.1] 2026-07-03¶

Security¶

Unsafe deserialization hardening (CWE-502): QuantizedModelLoader.load_quantized_model_pt() (alias onecomp.load_quantized_model_pt()) previously called torch.load(model.pt, weights_only=False) unconditionally, allowing arbitrary code execution when loading a malicious .pt checkpoint. It now refuses to load unless the caller explicitly opts in via allow_unsafe_deserialization=True, and emits a strong warning when it does load. For untrusted models, use the safetensors-based load_quantized_model(), which does not execute code.
Breaking change: existing callers of load_quantized_model_pt() must pass allow_unsafe_deserialization=True for trusted .pt files.
Quantizer.load_results() / ResultLoader: same hardening applied. Loading with weights_only=False now requires allow_unsafe_deserialization=True (added as a ResultLoader field), and logs a warning. The safe weights_only=True path is unchanged.
Updated docstrings, docs, and the LoRA SFT example to document the risk and the required opt-in.
Credit: this unsafe deserialization issue (CWE-502) was responsibly disclosed by Nir Yehoshua, Cipher Security Labs. Thank you for the report.

[v1.2.0] 2026-06-08¶

Save/Load Support for JointQ, RTN, and OneBit Quantizers¶

JointQ: Added get_quant_config(), finalize_quant_config_for_save(), and create_inference_layer() to JointQ class (onecomp/quantizer/jointq/_jointq.py)
Emits quant_method="gptq" to reuse GPTQLinear and vLLM GPTQ plugin (JointQ uses the same scale/zero/assignment structure as GPTQ)
create_inference_layer() converts JointQ's 3D assignment (out_features, num_groups, group_size) to 2D qweight (out_features, in_features) matching GPTQ format, with scale/zero transposition
Handles actorder permutation: restores original column order before passing to GPTQLinear so g_idx is constructed correctly
Symmetric quantization: shifts signed integers [-2^(n-1), 2^(n-1)-1] to unsigned [0, 2^n - 1] for GPTQLinear bit packing
Added bits == 1 warning in validate_params(): GPTQLinear weight packing does not support 1-bit; inference layer must be built with pack_weights=False
Added _build_quantization_bits() static method to emit per-layer quantization_bits metadata for mixed-precision save
RTN: Added get_quant_config(), finalize_quant_config_for_save(), create_inference_layer(), and RTNResult.compute_dequantized_weight() to RTN class (onecomp/quantizer/rtn/_rtn.py)
Emits quant_method="gptq" to reuse GPTQLinear and vLLM GPTQ plugin (RTN uses the same qweight/scales/qzeros tensor format)
compute_dequantized_weight() implements W = (quantized_weight - zero) * scale with per-channel and group-wise paths
create_inference_layer() transposes scale/zero from (out_features, num_groups) to (num_groups, out_features) for GPTQLinear compatibility
Added _build_quantization_bits() static method for per-layer metadata
OneBit: Added get_quant_config(), finalize_quant_config_for_save(), create_inference_layer(), and OnebitResult.compute_dequantized_weight() to Onebit class (onecomp/quantizer/onebit/_onebit.py)
Emits quant_method="onebit" with OneBit-specific parameters (iters, use_importance_scaling, use_balancing, balance_iters, balance_alpha)
compute_dequantized_weight() implements W ≈ a[:, None] * sign * b[None, :]
create_inference_layer() builds OneBitLinear via OneBitLinear.from_quantization_result()
Added _build_quantization_bits() static method for per-layer metadata

Apple Silicon / macOS support¶

MPS quantization: GPTQ (and AutoBit with GPTQ-only candidates) on device="mps"; cross-platform empty_cache() via new onecomp/utils/device.py (runner.py, quantizer/gptq/_gptq.py, quantizer/_quantizer.py)
MPS device placement (GPTQ on CPU, QEP correction on MPS): With device="mps", run_gptq moves the Hessian and weights to CPU for the full column-wise GPTQ loop (including inverse-Hessian Cholesky). The main reason is not absent Cholesky kernels on MPS (recent PyTorch supports them); if the GPTQ loop stayed on MPS, maxq.item() inside quantize() would run once per column—each call waits for pending MPS work to finish and read back a single scalar to the host (per-column host sync), not a full matrix copy per column—and that overhead is often several times slower than CPU on Apple Silicon (~4× in internal benchmarks with PyTorch 2.12). When QEP weight correction runs (adjust_weight, typically under qep=True), per-layer work stays on MPS (e.g. weight @ delta_hatX); only the Cholesky solve uses CPU via _safe_cholesky_and_solve (one solve per layer). A full CPU fallback for QEP does not materially improve speed. Calibration forwards may still use MPS. Details: README (macOS / MPS).
MPS inference: load saved quantized models on Mac with QuantizedModelLoader + Transformers generate() (GemLite/vLLM remain Linux + CUDA)
macOS uv sync: added darwin to tool.uv.environments, --extra mps for MPS-enabled PyTorch from PyPI; --extra cpu is Linux-only (pytorch-cpu index); Linux-only markers on CUDA extras (cu118–cu130)

New Feature : Dashboard¶

Added dashboard/, a browser-based web app for OneCompression on SLURM-managed HPC GPU nodes without Docker: pick a Hugging Face model and quantization settings in the UI, run jobs on the GPU, deploy the quantized checkpoint, and validate inference via chat
Stack: React + Vite frontend (local PC), FastAPI API, Celery worker + user-built Redis, SQLite job DB, per-job output under backend/tmp/quantized/; CUDA quantization via onecomp and chat deploy via a separate vLLM subprocess from the same backend/.venv (onecomp + vllm>=0.21 in pyproject.toml)
Quantization methods exposed in the UI: gptq, autobit, jointq, and auto_run (VRAM-based bitwidth / group size); optional QEP (not with JointQ); fractional bit widths for autobit / auto_run

New Feature: Global PTQ (Post-Training Quantization)¶

Added GlobalPTQ and GlobalPTQDistributed post-process classes for KL-distillation-based global optimisation of continuous quantization parameters (scales and zeros for GPTQ; scaling factors for DBF)
GlobalPTQ: Single-GPU implementation with cosine-warmup LR scheduling, early stopping, mixed-precision support, and gradient accumulation
GlobalPTQDistributed: Multi-GPU implementation using HuggingFace Trainer + DeepSpeed ZeRO-2, supporting KL divergence and/or NTP loss with automatic best-state rollback

Evaluation:¶

Added onecomp.eval and the onecomp-eval CLI: one vLLM server, subprocess evaluators, aggregated summary.json / summary.csv
Added mt_bench (Japanese MT-Bench) and opt-in throughput (TTFT / decode tok/s) evaluators

for Developer: pre-commit¶

Added .pre-commit-config.yaml with black, isort, and local hooks (no-japanese, copyright-header, no-email-address); install with uv sync --extra dev then pre-commit install (see README)

OneBitLinear Inference Layer Improvements¶

Added OneBitLinear.from_quantization_result() class method: builds OneBitLinear from OnebitResult (mirrors the pattern used by GPTQLinear and DoubleBinaryLinear) (onecomp/quantizer/onebit/onebit_layer.py)
Added OneBitLinear.from_saved_state() class method: reconstructs OneBitLinear from saved state_dict tensors (a, b, sign_packed, optional bias), using the same cls.__new__ pattern as DoubleBinaryLinear (onecomp/quantizer/onebit/onebit_layer.py)
Removed preunpack parameter from OneBitLinear.__init__() and replace_linear_with_onebit_layer(): sign matrix is now always stored as packed uint8 and unpacked on demand during forward(), matching the DBF inference layer pattern (onecomp/quantizer/onebit/onebit_layer.py)
Normalized buffers to FP16 with detach() in OneBitLinear.__init__() to drop autograd graph
Added _load_from_state_dict() override to clear sign_matrix cache when loading from checkpoint
Extracted _unpack_sign_matrix() helper for sign matrix unpacking logic
Removed unreferenced functions replace_linear_with_onebit_layer() and extract_onebit_weights_for_save() from onebit_layer.py: layer construction is now handled by OneBitLinear.from_quantization_result() / OneBitLinear.from_saved_state(), and save-time weight extraction is covered by the unified create_inference_layer() / state_dict() path (onecomp/quantizer/onebit/onebit_layer.py)

QuantizedModelLoader: OneBit Support¶

QuantizedModelLoader now supports quant_method="onebit" (onecomp/quantized_model_loader.py)
Added OneBitLinear to import and layer replacement logic
Added OneBitLinear.from_saved_state() call path for creating empty OneBit layers during model loading
Hadamard hook registration now recognizes OneBitLinear as a quantized layer class

BlockWisePTQ / CBQ OneBit Optimizer Compatibility¶

Updated OneBit block-wise and cross-block quantization (CBQ) optimizers to work with packed-only OneBitLinear (onecomp/post_process/_blockwise/onebit_block_optimizer.py, onecomp/post_process/_blockwise/onebit_cbq_optimizer.py)
Reads current sign matrices from sign_packed via my_unpack() when sign_matrix is not present, while still allowing sign_matrix as a temporary optimization override
Writes sign updates back to sign_packed with my_pack() and clears sign_matrix so packed signs remain the single source of truth after hard evaluation, best-state restore, and final updates
Hoisted my_pack / my_unpack imports in the OneBit CBQ optimizer
Clarified OneBitLinear.sign_matrix as a non-persistent temporary override used by optimization flows such as BlockWisePTQ and CBQ (onecomp/quantizer/onebit/onebit_layer.py)

Bug Fix¶

Fixed GPTQLinear.from_saved_state(): _weight_is_packed now defaults to False when wbits == 1 (JointQ wbits=1 checkpoints are saved with pack_weights=False because GPTQLinear packing does not support 1-bit) (onecomp/quantizer/gptq/gptq_layer.py)
Fixed redundant symmetric shift in RTN inference layer (onecomp/quantizer/rtn/_rtn.py)
Fixed run_onebit() returning False on NaN/Inf detection; now raises ValueError with proper GPU tensor cleanup to prevent OOM cascading (onecomp/quantizer/onebit/onebit_impl.py)
Removed pre-computed dequantized_weight from run_onebit() return dict and OnebitResult; dequantized weight is now computed on demand via compute_dequantized_weight() (onecomp/quantizer/onebit/onebit_impl.py, onecomp/quantizer/onebit/_onebit.py)
QuantizedModelLoader._cast_fp16_to_target_dtype() now skips OneBitLinear in addition to GPTQLinear and DoubleBinaryLinear, so OneBit's fp16 scaling buffers (a, b, bias) are preserved when loading a OneBit-quantized model that requires bfloat16 (e.g. Gemma 3 / Gemma 4 detected via needs_bfloat16). Without this, the post-load safety-net cast rewrote OneBit's stored fp16 metadata to bfloat16, breaking the dtype contract that OneBitLinear.forward relies on (self.a.to(x.dtype) / self.b.to(x.dtype) casts to the activation dtype at compute time). Updated the function's docstring to list OneBitLinear alongside the other quantized layer types whose fp16 metadata is intentionally retained (onecomp/quantized_model_loader.py).

Tests¶

Enabled inherited test_forward_error tests for JointQ, OneBit, and RTN (previously skipped with "does not support create_inference_layer") (tests/onecomp/quantizer/jointq/test_jointq.py, tests/onecomp/quantizer/onebit/test_onebit.py, tests/onecomp/quantizer/rtn/test_rtn.py)
Added _forward_error_features class attribute to BaseQuantizeSpec for parameterizing layer size in test_forward_error; JointQ overrides to 32 (requires in_features divisible by pack_factor = 32 // wbits) (tests/onecomp/quantizer/test_module.py)
Changed JointQ test default bits from 1 to 2 to match GPTQLinear packing constraints (tests/onecomp/quantizer/jointq/test_jointq.py)
Updated check_equal_results in RTN and OneBit tests to use compute_dequantized_weight() instead of direct dequantized_weight attribute access
Updated apply_quantized_weights in RTN and OneBit tests to use compute_dequantized_weight() with proper dtype preservation
Tightened GPTQ unit test tolerances in tests/onecomp/quantizer/gptq/test_gptq.py so regressions in dequantized-weight error are detected earlier (error < 0.4, max_error < 1.71; previously 0.6 / 2.5) (tests/onecomp/quantizer/gptq/test_gptq.py)
Fixed tests/onecomp/quantizer/test_module.py to feed y_replaced consistently into q_proj / k_proj / v_proj after quantized weights are applied, aligning the replacement-path forward test with the intended residual update flow
Extracted the duplicated attention+MLP forward loop in test_quantize_error into TestModel.forward() (tests/onecomp/quantizer/test_module.py); both the pre-quantization and post-quantization inference paths now call model(inp) directly, eliminating 34 duplicate lines

Dependencies¶

Pinned the vllm optional dependency to vllm>=0.10,<0.22 in pyproject.toml (and regenerated uv.lock). vLLM 0.22.0 removed the legacy Exllama GPTQ kernel that OneComp's GPTQ serving uses for low bit-widths (2-/3-bit, and 4-/8-bit models that are asymmetric or use desc_act), so vLLM 0.22 and later are not supported — serving affected models on 0.22+ fails at runtime. Documented in docs/user-guide/vllm-inference.md and docs/getting-started/installation.md.

Documentation¶

Documented save/load and vLLM compatibility for the newly-supported JointQ, RTN, and OneBit quantizers across the docs:
docs/api/quantizers/base.md: moved JointQ, RTN, and Onebit into the supported rows of the "Quantizer Feature Support" table (get_quant_config / create_inference_layer / Save / Quantized PPL/ACC all Yes), and added a new "Saved quant_method and vLLM compatibility" table mapping each quantizer to its emitted quant_method (gptq / mixed_gptq / dbf / onebit) and serving path
docs/user-guide/basic-usage.md: updated the quantized-model evaluation note and the "Quantizer feature support" table to include JointQ/RTN/OneBit, added a quant_method column, and clarified which saved models are vLLM-servable
docs/user-guide/vllm-inference.md: rewrote the "Supported Quantization Methods" table to distinguish vLLM's built-in GPTQ plugin (used for gptq: GPTQ uniform bits, JointQ, RTN) from the OneComp plugins (mixed_gptq, dbf), added a note that Onebit is not vLLM-servable, and listed the GPTQ/JointQ/AutoBit end-to-end examples; split the gptq row so GPTQ/RTN (wbits in {2, 3, 4, 8}) and JointQ (bits in {2, 3, 4}; bits=1 is OneComp load-only with pack_weights=False) document their distinct supported bit-widths
docs/algorithms/jointq.md: added a "Save and Load" section (emits quant_method="gptq", served by vLLM's built-in GPTQ plugin), a note that JointQ bits is limited to {2, 3, 4} for vLLM (the JointQ core quantizer rejects bits > 4) while bits=1 requires an explicit runner.save_quantized_model(..., pack_weights=False) and is OneComp load-only / not vLLM-servable, and clarified that JointQ does not support QEP (qep=False)
docs/algorithms/rtn.md: added a "Save and Load" section, a note that vLLM serving uses wbits in {2, 3, 4, 8} (RTN itself accepts a wider range, but GPTQ-compatible bit packing and vLLM serving are limited to these), and a warning that rotation-preprocessed RTN models cannot be served with vLLM (no online Hadamard transform), though they remain loadable with load_quantized_model()
docs/getting-started/quickstart.md, docs/index.md, README.md: updated quantized-model evaluation and vLLM integration descriptions to include JointQ/RTN/OneBit and reference the GPTQ built-in plugin path

Examples¶

Added example/vllm_inference/example_jointq_vllm_inference.py: end-to-end JointQ quantization (4-bit, group_size=128) → save → vLLM offline inference. Mirrors the GPTQ vLLM example, uses qep=False (JointQ does not support QEP), and documents the bits >= 2 requirement for vLLM bit-packing. Registered in the README example table.

[v1.1.1] 2026-05-21¶

New Feature: Quantization progress logging¶

Added QuantizationProgressTracker (onecomp/utils/quantization_progress.py) that emits a single [progress] INFO line per completed step with done/total, percentage, elapsed time, and a linear ETA estimate; supports an optional thread_safe=True mode for multi-GPU quantization
Added report_progress: bool = True flag to Runner.__init__ (onecomp/runner.py) and to the underlying entry points run_chunked_quantization (onecomp/runner_methods/chunked_quantization.py), run_multi_gpu_quantization / run_quantization_phase (onecomp/runner_methods/multi_gpu_quantization.py), run_quantize_with_qep (onecomp/qep/_quantize_with_qep.py), and run_quantize_with_qep_arch (onecomp/qep/_quantize_with_qep_arch.py) so long quantization runs (calibration, chunked, multi-GPU, QEP) report progress by default; pass report_progress=False for quiet runs
Demoted some INFO-level per-layer / per-chunk logs to DEBUG to avoid duplication with the new [progress] line (still available via logging.basicConfig(level=logging.DEBUG) for deep debugging)

Bug fixes: QEP + JointQ validation¶

Raise a clear error when Runner is configured with qep=True and a quantizer that does not support QEP (currently JointQ). Previously the run failed deep inside quantize_with_qep / adjust_weight with a confusing low-level error. Runner.check() now reports e.g. "Quantizer 'JointQ' (or one of its candidate quantizers) does not support QEP (Quantization Error Propagation). Set qep=False, or use a QEP-compatible quantizer (e.g., GPTQ, DBF, AutoBitQuantizer with QEP-compatible candidates)." Implementation: added flag_qep_supported (default True) on Quantizer, set to False on JointQ, and propagated via AutoBitQuantizer._sync_flags (only True when all candidate quantizers support QEP) (quantizer/_quantizer.py, quantizer/jointq/_jointq.py, quantizer/autobit/_autobit.py, runner.py).

Bug fixes: VLM save / load¶

Runner.save_quantized_model() now copies all auxiliary *.json and *.jinja files (e.g. preprocessor_config.json, processor_config.json, special_tokens_map.json, chat_template.jinja) from the original model directory to the save directory, so the quantized model is fully self-contained for VLM / multimodal inference. Weight tensors (*.safetensors, *.bin, *.pt, *.pth), weight index files, config.json and generation_config.json are skipped, and any file already written by model.save_pretrained / tokenizer.save_pretrained is preserved (runner.py).
Source-model directory resolution (incl. huggingface_hub.snapshot_download fallback for Hub IDs) was extracted into a private helper Runner._resolve_source_model_dir() (runner.py).
load_quantized_model() now re-establishes the lm_head <-> embed_tokens weight tie for models with tie_word_embeddings=True. load_state_dict(..., assign=True) would otherwise leave lm_head.weight as the freshly initialised tensor (typically float16) while embed_tokens.weight got replaced with the checkpoint tensor (typically bfloat16), causing RuntimeError: expected mat1 and mat2 to have the same dtype at the final lm_head matmul during generation. The re-tie is gated on lm_head still being an nn.Linear so it does not interfere when lm_head itself was quantized (quantized_model_loader.py).
load_quantized_model() now reads torch_dtype from config.json when no explicit torch_dtype is passed by the caller, so the empty model is built in the same dtype as the saved checkpoint. Previously it always defaulted to torch.float16, which left non-quantized VLM submodules (e.g. multi_modal_projector in Cohere2Vision) at fp16 whenever load_state_dict(..., assign=True) could not find their key in the state_dict (quantized_model_loader.py).
load_quantized_model() now casts any leftover float16 parameters and buffers of non-quantized modules to model.config.torch_dtype after the lm_head re-tie step. Quantized layers (GPTQLinear, DoubleBinaryLinear) and float32 params (e.g. fp32 LayerNorm in mixed-precision models) are deliberately untouched. This generalises the existing lm_head re-tie to any non-quantized module and fixes the dtype mismatch reported in issue 64-3 (RuntimeError: ... c10::Half != c10::BFloat16 on VLM image features) (quantized_model_loader.py).
Added regression tests tests/onecomp/runner/test_save_quantized_aux_files.py (auxiliary-file copy whitelist), tests/onecomp/runner/test_load_tied_embeddings.py (tied-embedding dtype round-trip) and tests/onecomp/runner/test_load_excluded_module_dtype.py (non-quantized module dtype handling, including config-based empty-model dtype default, fp16 safety-net cast, fp32 preservation, and quantized-layer skip).
Loosened test_save_load_pipeline_tinyllama.py and test_save_load_pipeline_qwen3.py save/load round-trip threshold from absolute 1e-3 to relative 1% of the per-tensor logits magnitude (tests/onecomp/pre_process/test_save_load_pipeline_*.py). The original absolute bound was below fp16's representable precision once accumulated through the 22-28 decoder layers of TinyLlama / Qwen3, causing the gptq + save_dequantized cases to fail on aarch64 + Blackwell (GB200) where cuBLAS picks slightly different reduction kernels than reference x86_64 / Hopper hosts. The save/load equivalence intent is preserved via the relative comparison, which is robust to platform-specific fp16 rounding noise.
Set gpu_memory_utilization=0.78 explicitly when constructing LLM(...) in example/vllm_inference/example_autobit_vllm_inference.py and example/vllm_inference/example_gptq_vllm_inference.py. The vLLM default 0.92 cgroup-OOMs on UMA hosts (e.g. DGX Spark / GB200, 121.7 GiB UMA) because vLLM's startup memory check fails: the residual quantizer process leaves only ~106 GiB free, which is below 0.92 * 121.7 = 111.96 GiB. 0.78 matches the value already used in tests/vllm_plugins/gptq/test_mixed_gptq_e2e.py and is documented in the workspace slurm-submit.mdc rule.

Logging / observability tweaks¶

Runner._copy_auxiliary_files() now emits a matter-of-fact INFO-level log when an auxiliary file from the original model directory is not copied because the destination already contains a file of the same name (typically because tokenizer.save_pretrained wrote it just before, or a previous save_quantized_model call did). The new line is symmetrical to the existing Copied %s to save directory entry so the auxiliary-copy step can be audited end-to-end (runner.py).
QuantizedModelLoader._cast_fp16_to_target_dtype() now returns the list of fully-qualified parameter / buffer names whose dtype was actually converted instead of a plain count. The post-load INFO log in load_quantized_model() includes those names so it is obvious which non-quantized submodules were normalised by the safety-net cast (e.g. multi_modal_projector.linear_* in Cohere2Vision). Existing tests are updated accordingly and a new test pins the buffer-name reporting (quantized_model_loader.py, tests/onecomp/runner/test_load_excluded_module_dtype.py, tests/onecomp/runner/test_save_quantized_aux_files.py).
QuantizedModelLoader.load_quantized_model() now detects tie_word_embeddings=True even when the flag is nested in a sub-config (e.g. model.config.text_config.tie_word_embeddings in Llama 3.2-Vision and other torchtune-derived VLMs) by walking one level of sub-configs. Previously the flag was only read from the top-level model.config, so VLMs that placed it in text_config skipped the post-load re-tie; with HF deduplicating lm_head.weight for tied checkpoints, that left lm_head.weight at the empty-model random initial values rather than re-pointing to embed_tokens.weight (quantized_model_loader.py).

Tests¶

Added regression tests for the save/load fixes above: tests/onecomp/runner/test_save_quantized_aux_files.py (auxiliary-file copy whitelist), tests/onecomp/runner/test_load_tied_embeddings.py (tied-embedding dtype round-trip), and tests/onecomp/runner/test_load_excluded_module_dtype.py (non-quantized module dtype handling, including config-based empty-model dtype default, fp16 safety-net cast, fp32 preservation, and quantized-layer skip).
Added tests/onecomp/test_runner_check.py for the new qep=True validation path: JointQ + qep=True raises a clear ValueError, while JointQ + qep=False and GPTQ + qep=True both pass Runner.check().
Added tests/onecomp/runner/test_load_tied_embeddings.py::test_should_retie_word_embeddings_* unit tests covering top-level, nested-text-config, all-False and unrelated-sub-attribute shapes.

New Contributors¶

@sotanengel made their first contribution in #13

[v1.1.0] 2026-04-16¶

Gemma 3 / Gemma 4 & VLM Support¶

Auto-detect language_model / text_model sub-modules in setup() so only the language model is quantized; vision_tower, audio_tower, etc. are automatically excluded (quantizer/_quantizer.py)
Added unfuse_moe.py: MoE models (e.g. Gemma 4) store all expert weights as fused 3D nn.Parameter tensors (gate_up_proj [E, 2*inter, hidden], down_proj [E, hidden, inter]), but GPTQ and other layer-wise PTQ methods require 2D nn.Linear layers. unfuse_moe_experts() splits the fused tensors into per-expert modules, producing paths like experts.0.gate_proj, experts.0.up_proj, experts.0.down_proj (utils/unfuse_moe.py)
Set quant_method to mixed_gptq for MoE models during save, enabling vLLM to handle a mix of quantized and unquantized expert layers via UnquantizedFusedMoEMethod (runner.py)
Introduced prepare_block_kwargs to reproduce Gemma 4-specific additional inputs during block-wise forward (runner_methods/chunked_quantization.py, qep/_quantize_with_qep_arch.py)
_per_layer_inputs: pre-compute per-layer embeddings for all calibration samples
_position_embeddings_map: hook into rotary_emb to capture position embeddings per layer type
_attention_mask_map: pre-compute masks per layer type via create_causal_mask / create_sliding_window_causal_mask
Updated Catcher.forward to accept *args (Gemma 4 passes per_layer_input as a positional argument)
Added a guard to safely skip KV-shared layers where k_proj / v_proj are never called during forward and X^TX is not accumulated (runner_methods/chunked_quantization.py)
Added token_type_ids (mm_token_type_ids) required by Gemma 4 to calibration data and PPL computation (utils/calibration.py, utils/perplexity.py)
Added model argument to prepare_calibration_dataset; model-specific inputs are appended via add_model_specific_inputs()
Changed model.device to next(model.parameters()).device to support VLM device_map="auto"
Fixed MoE block partitioning (down_proj and router.proj were incorrectly placed in the same block) and relaxed Hessian input shape assertion for 2D tensors after router dispatch
Added layer-suffix fallback lookup for Gemma 3's shared sub-modules where named_modules() paths differ from state_dict() keys (quantized_model_loader.py)
save_quantized_model() now copies processor_config.json from the source model so the quantized model directory is self-contained for multi-modal inference (runner.py)
Added skip logic in vLLM plugin to prevent vision / audio encoder layers from being incorrectly matched to language model quantization configs (vllm_plugins/utils/module.py)
Override ModelConfig dtype to bfloat16 for Gemma 3/4 models whose values exceed the float16 range, preventing performance degradation (model_config.py)
Fixed an issue where non-language-model layers in multi-modal models were included in AutoBit bit allocation
Bumped transformers requirement from >= 5.3.0 to >= 5.5.0 (pyproject.toml)
Gemma 4's model_type: gemma4 is registered in CONFIG_MAPPING starting from 5.5.0 (released 2026-04-02); 5.3.0 fails to load it
Added cu130 extra for the validation environment (NVIDIA B200, CUDA 13.0); under cu128, torch (cu130) and torchvision (cu128) had a CUDA version mismatch

New Feature: LPCD (Layer-Projected Coordinate Descent)¶

Added onecomp/lpcd/ sub-package implementing the LPCD unified framework (arXiv:2512.01546) that extends layer-wise PTQ by jointly optimising sub-module groups (QK / VO / MLP / residual) with closed-form and gradient-based solvers
Added benchmark/llama3-8b-lpcd-gptq/: Llama-3-8B LPCD+GPTQ SLURM array benchmark (Hydra config conf/benchmark_llama3-8b.yaml, quant_benchmark.py, README.md with WikiText-2 PPL / lm-eval-harness accuracy / quantization time for 4-bit and 3-bit × {q_proj·k_proj, v_proj·o_proj, up_proj·down_proj, all, residual} on NVIDIA B200)
Added example/example_lpcd_gptq.py: TinyLlama GPTQ 3-bit (groupsize=128) + QEP + LPCD end-to-end example with residual-only closed-form refinement (enable_residual=True, use_closed_form=True) and original / dequantized / quantized perplexity reporting
Updated README.md: added LPCD to Features, Examples, and Citation sections

New Feature: BlockWisePTQ¶

Implemented BlockWisePTQ.run() pipeline (onecomp/post_process/blockwise_ptq.py)
Phase 1: per-block distillation with teacher model (GPTQ / DBF / OneBit / Generic)
Phase 2: Cross-Block Quantisation (CBQ) sliding-window optimisation (K=2)
Teacher model loaded via model_config.load_model(device_map="cpu")
Calibration inputs collected via Catcher hook on first transformer block
Added onecomp/post_process/_blockwise/ sub-package (9 modules)
helpers.py: collect_layer_inputs, auto_detect_quantization_strategy, get_transformer_layers, layer_kwargs_to_device, etc.
Phase 1 optimisers: gptq_block_optimizer.py, dbf_block_optimizer.py, onebit_block_optimizer.py, generic_block_optimizer.py
Phase 2 CBQ optimisers: gptq_cbq_optimizer.py, dbf_cbq_optimizer.py, onebit_cbq_optimizer.py
All optimisers use float32 promotion, best-state tracking with rollback, and hard MSE evaluation
Set use_gemlite=False in Runner.run_post_processes() (onecomp/runner.py) to avoid GemLite fp16-only Triton kernel incompatibility with float32 block optimisation
Added VLM support for BlockWisePTQ (Qwen3-VL, Qwen2.5-VL, etc.)
helpers.py: get_transformer_layers / _get_language_model_backbone handle model.model.language_model.* path
model_config.py: load_model() falls back to AutoModelForImageTextToText for VLM configs
Fixed Quantizer.calculate_hessian / calculate_delta_hatX (onecomp/quantizer/_quantizer.py): handle 2D activations from OPT-style architectures

Quantizer Unification¶

Unified scale/zero/integer logic across WeightQuantizer, RTN, and GPTQExcecutor for both symmetric and asymmetric quantisation
WeightQuantizer.configure / find_params / quantize (quant_models.py), STEQuantize.forward (quant_models.py), pseudo_quantize_tensor / quantize (rtn/quantizer.py), GPTQExcecutor.configure / find_params (gptq/_gptq.py)
Added optional MSE grid search (mse, norm, grid) to WeightQuantizer, RTN, and prepare_rotated_model
WeightQuantizer.configure / find_params (quant_models.py), pseudo_quantize_tensor (rtn/quantizer.py), run_rtn (rtn/rtn_impl.py), RTN dataclass / validate_params (rtn/_rtn.py), prepare_rotated_model (prepare_rotated_model.py), apply_preprocess_train / _insert_weight_quantizer (train_rotation.py)
Removed perchannel and maxshrink from public APIs; perchannel=True is now always used internally
Removed from RTN dataclass (rtn/_rtn.py) and prepare_rotated_model (prepare_rotated_model.py). Internally, run_rtn (rtn/rtn_impl.py) and _insert_weight_quantizer (train_rotation.py) pass perchannel=True unconditionally. Low-level APIs pseudo_quantize_tensor (rtn/quantizer.py) and WeightQuantizer.configure (quant_models.py) still accept the parameters

Rotation Preprocessing Improvements¶

Added "random_hadamard" and "hadamard" rotation modes (existing: "random", "identity")
PreprocessManager._ortho (train_rotation.py), _VALID_ROTATION_MODES (prepare_rotated_model.py)
Changed prepare_rotated_model defaults: rotation_mode → "random_hadamard", num_calibration_samples → 512
prepare_rotated_model (prepare_rotated_model.py), PreprocessManager.__init__ (train_rotation.py)
Added input validation (_validate_prepare_rotated_model_params) for all prepare_rotated_model parameters
_validate_prepare_rotated_model_params (prepare_rotated_model.py)
Added per-step and total execution time logging to prepare_rotated_model
prepare_rotated_model (prepare_rotated_model.py): timed sections for model load, calibration prep, training, reload, apply_preprocess_eval, and save
Added explicit gradient_accumulation_steps=1 to TrainingArguments defaults
TrainingArguments.gradient_accumulation_steps (preprocess_args.py)

AutoBit: per-quantizer groupsize support¶

AutoBitQuantizer supports each candidate quantizer's groupsize individually, enabling mixed group-size configurations (onecomp/quantizer/autobit/_autobit.py)
RTN error evaluation uses per-quantizer grouped quantisation (onecomp/quantizer/autobit/ilp.py)
Added test for mixed group-size autobit (tests/onecomp/quantizer/autobit/test_autobit.py)
Remove default quantizer from AutoBit; a quantizer must be explicitly provided. (onecomp/quantizer/autobit/_autobit.py)

CalibrationConfig: unified calibration configuration¶

Breaking change: Introduced CalibrationConfig dataclass (onecomp/calibration/calibration_config.py) to consolidate all calibration-related parameters
Runner.__init__ now accepts calibration_config: CalibrationConfig instead of individual parameters (calibration_dataset, max_length, num_calibration_samples, calibration_strategy, calibration_seed, calibration_batch_size, num_layers_per_group)
AutoBitQuantizer now accepts calibration_config: CalibrationConfig instead of num_calib_samples, calib_seqlen, calibration_dataset
prepare_rotated_model() now accepts calibration_config: CalibrationConfig instead of calibration_dataset, max_length, num_calibration_samples, calibration_strategy
BlockWisePTQ now accepts calibration_config: CalibrationConfig instead of num_calibration_samples, max_length, calibration_strategy, calibration_seed
When calibration_config=None, default values are created automatically (calibration_dataset="c4", max_length=2048, num_calibration_samples=512)
New user-configurable parameters exposed via CalibrationConfig: text_key, use_quality_filter, max_documents (previously hard-coded in calibration_data_loader.py)
Added cross-validation in Runner.check(): if both Runner and AutoBitQuantizer specify calibration_dataset, they must match
Removed backward-compatibility re-exports from onecomp/utils/__init__.py (prepare_calibration_dataset, load_c4_for_aligned_chunks, load_c4_for_n_samples_min_length); import from onecomp.calibration instead
Added unit tests for calibration module (tests/onecomp/calibration/)
Internal functions now accept CalibrationConfig directly instead of individual parameters:
prepare_calibration_dataset() (calibration_data_loader.py): replaced 8 individual parameters with calibration_config: CalibrationConfig (required argument)
run_chunked_quantization() (runner_methods/chunked_quantization.py): calibration_dataset, max_length, num_calibration_samples, calibration_strategy, calibration_seed, calibration_batch_size, num_layers_per_group replaced by calibration_config
run_multi_gpu_quantization(), run_capture_phase(), get_calibration_config_dict() (runner_methods/multi_gpu_quantization.py): same consolidation
run_quantize_with_qep() (qep/_quantize_with_qep.py), run_quantize_with_qep_arch() (qep/_quantize_with_qep_arch.py): same consolidation
collect_activation_stats_blockwise() (quantizer/autobit/activation_stats.py): num_samples, seqlen, calibration_dataset replaced by calibration_config
Code quality improvements:
CalibrationConfig.calibration_dataset defaults to "c4" instead of None (no more implicit fallback)
Removed implicit dataset inheritance from quantizer to Runner; use explicit CalibrationConfig instead
Cross-validation uses isinstance(quantizer, AutoBitQuantizer) instead of duck typing
Consolidated from .calibration import CalibrationConfig, prepare_calibration_dataset into single import
Added missing "concat_rand" strategy to prepare_calibration_dataset() docstring
Documented batch_size and num_layers_per_group in CalibrationConfig as chunked-quantization-only parameters

Calibration data: support WikiText-2, custom datasets, and C4 quality filtering¶

Refactored onecomp/utils/calibration.py into onecomp/calibration/ folder with submodules
calibration_data_loader.py: unified entry point prepare_calibration_dataset() that dispatches by dataset name or file path
c4.py: C4 dataset loader with optional quality filtering (check_text_quality())
wikitext.py: WikiText-2 dataset loader (new; loads from Salesforce/wikitext)
custom.py: custom dataset loader supporting .txt, .json, .jsonl, .csv, .tsv, .parquet, .arrow, and HuggingFace Dataset directories
chunking.py: shared chunking strategies (concat_chunk, concat_chunk_align, concat_rand, drop_head, drop_rand) extracted as reusable helpers
Added calibration_dataset parameter to AutoBitQuantizer to specify the calibration data source (onecomp/quantizer/autobit/_autobit.py)

JointQ:¶

Added incremental_lambda regularization mode (lambda_mode="incremental_lambda"): for each layer, tries increasing lambda values from lambda_list with warm start, accepting candidates that improve weight error without substantially degrading output error. Stops at the first rejection. Controlled by lambda_list, incremental_eps_y, incremental_eps_w parameters
Added incremental_initial_skip_ew_threshold to skip an unstable initial lambda=0.0 candidate when its relative weight error is excessively large
Added accepted_lambda field to JointQResult to record the per-layer lambda chosen in incremental mode
Added execute_post_processing override to log accepted lambda statistics (mean, median, min, max, per-value counts) after all layers are quantized
Added regularization_mode parameter: "identity" (standard Tikhonov λI) or "diagonal" (default, importance-aware λ·diag(a) where a_i scales with activation magnitude). Diagonal mode reduces over-regularization of less important columns. Only supported with fixed_lambda mode
Added regularization_gamma parameter (default 0.5): exponent for diagonal weights in "diagonal" mode; smaller values reduce the spread between weak and strong columns
Added initialization strategy control: enable_clip_optimize, enable_clip_optimize_ep, enable_gptq parameters to JointQ class
Added gptq attribute (GPTQ instance) to JointQ class for customizing GPTQ parameters (blocksize, percdamp, mse, q_grid, q_norm). Default GPTQ is auto-created from bits/group_size/symmetric
Replaced JointQ internal GPTQ module (jointq/core/gptq.py) with OneComp GPTQ (onecomp.quantizer.gptq.GPTQ); GPTQ initial solution is now generated via the shared GPTQ implementation
Improved numerical stability of scale optimization for ill-conditioned matrices
Fixed potential division-by-zero in scale computation

Breaking Changes¶

AutoBitQuantizer.enable_fused_groups now defaults to True (onecomp/quantizer/autobit/_autobit.py)
Ensures that vLLM fused layers (qkv_proj, gate_up_proj) are assigned the same quantizer (same bits and groupsize), which is required for vLLM inference.
Previously defaulted to False, which could cause vLLM to reject the model at load time when fused-layer constituents had mismatched configurations.
Runner.auto_run() already set enable_fused_groups=True, so this change has no effect on auto_run users.
Migration: If you use AutoBitQuantizer with candidate bit-widths not supported by vLLM (e.g. wbits=5), pass enable_fused_groups=False explicitly.
Quantisation levels unified to unsigned [0, 2^b − 1] (symmetric uses centred zero point); rounding order changed from round(x/s + z) to round(x/s) + z. Outputs are not bit-exact with prior RTN versions
Changed prepare_rotated_model defaults: rotation_mode "random" → "random_hadamard", num_calibration_samples 128 → 512
Introduced CalibrationConfig dataclass; see CalibrationConfig section above for migration details
JointQ class: removed batch_size parameter (use onecomp.quantizer.jointq.core.quantize() directly if batch processing is needed)
JointQ: GPTQ initial solution is now generated via OneComp GPTQ instead of the internal implementation. Quantization results are not bit-exact with prior versions (quality is equivalent or improved)
JointQ regularization defaults changed (onecomp/quantizer/jointq/_jointq.py)
regularization_lambda: 0.2 → 0.1
regularization_mode: "identity" → "diagonal"
Quantization results are not bit-exact with prior versions.
Migration: to reproduce the previous behavior, pass JointQ(..., regularization_lambda=0.2, regularization_mode="identity") explicitly.

Bug Fix¶

Fixed model_config.py: load_model() VLM fallback did not trigger for models raising "Unrecognized configuration class" (e.g. Cohere2VisionForConditionalGeneration). Added the error pattern to _vlm_hints
Fixed gptq/_gptq.py: Cholesky decomposition in run_gptq could fail with LinAlgError on ill-conditioned Hessians (observed on large VLMs at deeper layers). Extracted _compute_inverse_hessian() with progressive damping fallback (up to 5 retries, 10x damping increase per retry). No impact on normal operation
Fixed TypeError in QuantLinear.forward when S_qk scaling was applied to MLP layers (onecomp/pre_process/quant_models.py)
Fixed JointQ group_size=None (per-channel quantization) raising TypeError
Fixed wrong module grouping in make_grouped_module where GC-driven id() reuse caused attention projections (q/k/v) and MLP projections (gate/up) to be merged into the same group. (qep/_quantize_with_qep_arch.py)
Fixed silent weight corruption in GPTQLinear when qzero=0 was stored through the GPTQ v1 zero-point path (onecomp/quantizer/gptq/gptq_layer.py)
Root cause: AutoGPTQ v1 stores raw_zero - 1, so qzero=0 becomes -1; without masking, its sign-extended bits corrupted neighboring packed slots
Pack-side fix (_pack_rows): mask each value with (1 << wbits) - 1 before shift/OR (2/4/8-bit and 3-bit paths)
Forward-side fix (GPTQLinear.forward): apply & wbits_mask after the v1 +1 restoration so stored 2^wbits - 1 wraps back to 0; the gptq_v2 path remains unchanged
Added regression tests for per-slot pack corruption, packed/unpacked forward paths, the gptq_v2 branch, and the from_saved_state path for GPTQ v1 tensors
NOTE: If you have GPTQ models quantized with previous versions, please re-quantize them with this release, as they may contain corrupted internal data.

Examples¶

Added example/example_custom_calibration.py: Demonstrates CalibrationConfig with a custom calibration dataset (Python code snippets in example/data/python_calibration.txt). Quantizes TinyLlama with GPTQ 3-bit using both default C4 and custom Python-code calibration, then compares inference outputs across multiple prompts to show how calibration data choice affects quantization quality.
Added example/post_process/example_blockwise_ptq.py: GPTQ 4-bit quantization + BlockWisePTQ (Phase 1 greedy + Phase 2 CBQ) with PPL comparison
Added example/example_lpcd_gptq.py: TinyLlama GPTQ 3-bit (groupsize=128) + QEP + LPCD end-to-end example (residual-only closed-form refinement, original / dequantized / quantized perplexity reporting)
Updated example/vllm_inference/example_gptq_vllm_inference.py: changed model to TinyLlama-1.1B-Chat-v1.0 (chat model), disabled QEP, added CalibrationConfig(num_calibration_samples=128, max_length=512)
Added NOTE comments across all example/ scripts (example_gptq.py, example_qep_gptq.py, example_jointq.py, example_autobit.py, example_save_load.py, example_lpcd_gptq.py, example_custom_calibration.py, post_process/example_blockwise_ptq.py, post_process/example_lora_sft.py, post_process/example_lora_sft_knowledge.py, pre_process/example_preprocess_save_load.py, vllm_inference/example_gptq_vllm_inference.py) clarifying that the compact CalibrationConfig(max_length=512, num_calibration_samples=128) settings are demo-only and recommending the CalibrationConfig() defaults (max_length=2048, num_calibration_samples=512) for real quantisation; qep=False examples additionally recommend setting batch_size so that Runner.quantize_with_calibration_chunked is used with large calibration data

Documentation¶

Updated docs/algorithms/jointq.md: added incremental lambda mode description, acceptance criteria, diagonal regularization mode, new parameters (lambda_mode, lambda_list, incremental_eps_y, incremental_eps_w, incremental_initial_skip_ew_threshold, regularization_mode, regularization_gamma), and usage examples
Added docs/algorithms/lpcd.md: LPCD overview, motivation, supported submodule groups, usage examples, QEP relationship, parameters, and current support
Added docs/api/lpcd_config.md and updated mkdocs.yml navigation to include LPCD in the Algorithms and API Reference sections
Updated LPCD references across docs: docs/index.md, docs/algorithms/overview.md, docs/user-guide/basic-usage.md, docs/user-guide/configuration.md, docs/user-guide/examples.md, and docs/getting-started/quickstart.md
Updated docs/algorithms/rtn.md: corrected defaults, added MSE parameters, updated algorithm description
Updated docs/user-guide/pre-process.md: expanded key parameters table, added validation note
Added "Chat with Open WebUI" section to docs/user-guide/vllm-inference.md: step-by-step guide for connecting Open WebUI to a vLLM server (Docker / pip install with dedicated venv, connection settings, chat usage)
Added Open WebUI mention to README.md Features and vLLM Inference sections, and docs/index.md Key Features
Fixed broken example/example1.py references in README.md and docs/getting-started/installation.md (replaced with example/example_gptq.py)
Added example/example_custom_calibration.py to README.md Examples table under new "Calibration" category
Removed scratch files from example/: buf.py, buf2.py, run_example.sh
Fixed outdated install command in docs/user-guide/cli.md (git+URL → pip install onecomp) and added PyTorch prerequisite with link to installation guide
Removed duplicate "Chunked Calibration" section (was a copy of JointQ section) in docs/user-guide/examples.md
Added missing CalibrationConfig import to Block-wise PTQ code snippet in docs/user-guide/examples.md
Fixed eval support note in docs/getting-started/quickstart.md to include AutoBitQuantizer (was "GPTQ and DBF only")
Added algorithm pages: docs/algorithms/autobit.md (ILP-based mixed-precision) and docs/algorithms/jointq.md (joint optimization)
Added AutoBit and JointQ with links to docs/algorithms/overview.md Available Algorithms table
Added docs/api/quantizers/onebit.md (OneBit API reference)
Updated mkdocs.yml nav: added AutoBit/JointQ algorithm pages, OneBit API page; renamed Post-Process nav title to include Block-wise PTQ
Added example script links to docs/user-guide/pre-process.md
Updated docs/algorithms/jointq.md: added initialization strategy parameters, GPTQ customization examples, removed batch_size from parameter table, added group_size=None per-channel documentation
Added a Troubleshooting section to docs/user-guide/vllm-inference.md describing how to bypass the unconditional DeepGEMM (FP8) kernel warmup for non-FP8 quantization (GPTQ / DBF / Mixed-GPTQ) by setting VLLM_USE_DEEP_GEMM=0 and VLLM_DEEP_GEMM_WARMUP=skip

Tests¶

Updated test_jointq.py: added incremental_lambda boundary parameters (lambda_mode, lambda_list, incremental_eps_y, incremental_eps_w, incremental_initial_skip_ew_threshold), abnormal parameter cases (invalid mode, empty list, negative values), GPU integration tests (test_incremental_lambda_basic, test_incremental_lambda_single_step), _accept_candidate unit tests covering all acceptance rules and edge cases; removed batch_size from boundary/abnormal parameter tests
Added test_prepare_rotated_model.py: validation, E2E pipeline, output threshold (80 combinations), save/load round-trip
Added test_weight_quantizer.py: RTN/GPTQ consistency, symmetric/asymmetric, group-wise, MSE, STE
Expanded test_rtn.py: MSE boundary/abnormal parameters
Added vLLM mixed group-size tests (tests/vllm_plugins/gptq/test_mixed_gptq.py, tests/vllm_plugins/gptq/test_mixed_gptq_e2e.py)
Updated regression_quantize_helper.py: updated EXPECTED_MSE baseline for OneComp GPTQ integration

Benchmarking¶

Updated all benchmark directories for v1.1.0
Results will be added after benchmark runs complete.

Dependencies¶

Updated pyproject.toml and uv.lock: added hydra-core to dev optional dependencies
Added LPCD tests (tests/onecomp/lpcd/, 25 cases)
test_lpcd_config.py: LPCDConfig default / custom values, dataclass field set, top-level from onecomp import LPCDConfig (CPU only)
test_lpcd_metrics.py: make_lpcd_metrics() dispatch on synthetic Llama / Qwen3 blocks for every enable_* flag combination, NotImplementedError for unsupported architectures, LpcdMetricGroup.mark_as_ready / is_refineable state transitions (CPU only, no weight download)
test_lpcd_runner.py: end-to-end GPTQ + QEP + LPCD on the first TinyLlama decoder block — smoke (Runner.run() completes, all linear layers quantized, dequantized weights finite), QEP + LPCD combination with explicit QEPConfig, behavioural checks (residual-only LPCD modifies o_proj / down_proj beyond the QEP-only baseline while pre-attention q/k/v_proj match the baseline bit-for-bit); auto-skipped on non-CUDA hosts via pytest.mark.skipif

Model Validation¶

Added model_validation/README.md: parent overview of the operational validation suite that exercises OneComp's end-to-end quantize → save → load → inference workflow across multiple architectures and sizes. Provides cross-recipe At-a-Glance status tables (per-model × per-recipe quantization status, and per-model × per-recipe (save, transformers inference, vllm inference) status), per-recipe result tables, and a Summary section that explicitly cautions against cross-recipe PPL comparison (PPL is reported only as a per-recipe sanity check; the compact calibration in use sits below typical research settings, partly because the calibration size has to fit the DGX Spark 128 GB UMA budget for 7–8B models with QEP on).
Added model_validation/gptq/: Hydra-driven GPTQ (wbits=4, groupsize=128, qep=False) end-to-end validation across three phases:
Phase 1 — quantize + save (validate_gptq.py, conf/validate.yaml): CalibrationConfig(max_length=512, num_calibration_samples=128), saved via runner.save_quantized_model(...), reports original / quantized PPL on wikitext-2-raw-v1.
Phase 2 — load + greedy generation (validate_load.py, conf/validate_load.yaml): reloads each saved directory via load_quantized_model and runs "Fujitsu is" with max_new_tokens=32. torch_dtype is overridable (float16 / bfloat16 / float32 / null); gemma-4-E2B requires bfloat16 because the loader's default float16 triggers a Half / BFloat16 mismatch at lm_head.
Phase 3 — vLLM offline inference (validate_vllm.py, conf/validate_vllm.yaml): reloads each saved directory via vLLM's offline LLM interface (OneComp's vLLM plugin is auto-registered, so no explicit quantization= argument is required) and runs "Fujitsu is" with temperature=0.0, max_tokens=32, enforce_eager=True, max_model_len=512. LLM(...) and llm.generate(...) are kept inside main() behind if __name__ == "__main__": so vLLM worker subprocess re-imports do not recursively spawn new engines. Currently pending for all five models.
Added model_validation/qep_gptq/: Hydra-driven GPTQ (wbits=4, groupsize=128, qep=True) end-to-end validation script (validate_gptq.py, conf/validate.yaml, README.md). Calibration: max_length=1024, num_calibration_samples=128 (reduced from defaults to keep 7–8B models within the DGX Spark 128 GB UMA budget with QEP on). Quantize + save only; load / inference is not exercised in this subdirectory yet.
Added model_validation/autobit/: Hydra-driven AutoBit (target_bit=4, qep=False) end-to-end validation script (validate_autobit.py, conf/validate.yaml, README.md). Candidates GPTQ(wbits=b, groupsize=128) for b in (2, 3, 4, 8), assignment_strategy="activation_aware", CalibrationConfig(max_length=512, num_calibration_samples=128). Quantize + save only.
Updated model_validation/autobit_qep/: AutoBit (target_bit=4, qep=True) validation. Reduced calibration to max_length=1024, num_calibration_samples=128 to keep 7–8B models within the DGX Spark 128 GB UMA budget. README expanded with per-model bit-assignment counts (GPTQ_<b>_gs128: <count> layers) for TinyLlama-1.1B, Llama-2-7B, Llama-3-8B, Qwen3-8B, and gemma-4-E2B; documents the bimodal 8-bit / 2-bit ILP collapse on gemma-4-E2B (every module in the first 15 transformer blocks → 8-bit, remaining 20 blocks → 2-bit; quantized PPL diverges to ~10^14, reproduced after reducing calibration from max_length=2048, num_calibration_samples=512) and lists candidate follow-ups (restrict candidate set, disable QEP, switch assignment_strategy).
Added model_validation/jointq/: Hydra-driven JointQ (bits=4, group_size=128, symmetric=True, qep=False) end-to-end validation script (validate_jointq.py, conf/validate.yaml, README.md). Calibration: CalibrationConfig(max_length=512, num_calibration_samples=128). JointQ does not currently expose a quantized-inference layer (no save_quantized_model / create_quantized_model path), so quality is sanity-checked on the dequantized model (weights reconstructed from JointQ's quantization parameters); save / inference are reported as n/a in the parent At-a-Glance table.
All five recipes share the same model selection contract: a single model selected via either model_id (Hugging Face Hub) or model_path (local directory), with any field in conf/validate*.yaml overridable on the command line. Default validation set across all recipes is TinyLlama-1.1B, gemma-4-E2B (base), Llama-2-7B, Llama-3-8B, and Qwen3-8B.

Packaging¶

Bumped minimum transformers requirement from >=5.3.0 to >=5.5.0 (pyproject.toml)
Added cu130 optional-dependency extra and the pytorch-cu130 wheel index (https://download.pytorch.org/whl/cu130) for CUDA 13 hosts (e.g. NVIDIA B200) (pyproject.toml)
Pinned the vllm extra to vllm>=0.10 to prevent uv from falling back to legacy versions whose source build requires CUDA_HOME (pyproject.toml)
Added uv conflicts declarations between the vllm extra and the cpu / cu118 / cu121 / cu124 / cu126 / cu128 extras: vLLM >=0.20 requires torch>=2.10, which is only published for cu130. This forces vllm to be installed only with --extra cu130 and prevents silent fallback to a vllm version incompatible with transformers>=5 at runtime (pyproject.toml)
Restricted tool.uv.environments to sys_platform == 'linux' and python_full_version >= '3.12', < '3.14' to skip lock splits for unused Windows and out-of-range Python versions (pyproject.toml)
Added hydra extra to pyproject.toml so hydra-core (used by example/example_autobit.py and the model_validation/{gptq,qep_gptq,autobit,autobit_qep,jointq}/validate_*.py scripts) installs in one step via uv sync --extra <cuXXX> --extra hydra or pip install "onecomp[hydra]", instead of a separate pip install hydra-core after sync. Documented the new extra in README.md and the model_validation/*/README.md files. The model_validation/gptq/ Phase 3 (vLLM inference) additionally requires the vllm extra (uv sync --extra <cuXXX> --extra hydra --extra vllm or pip install "onecomp[hydra]" vllm).

[v1.0.2] 2026-03-31¶

Bug Fix¶

Fixed ImportError when running onecomp CLI without matplotlib installed; AutoBitQuantizer._visualize() now catches the import error and logs a warning instead of crashing

[v1.0.1] 2026-03-31¶

Packaging¶

Moved matplotlib from dev extra to new visualize extra in pyproject.toml
Made visualize_bit_assignment import lazy in onecomp/quantizer/autobit/__init__.py to avoid requiring matplotlib at import time
Updated installation instructions in README.md and docs/getting-started/installation.md to reflect the new visualize extra
Updated uv.lock

[v1.0.0] 2026-03-31¶

PyPI Publishing Setup¶

Added PyPI metadata to pyproject.toml: keywords, classifiers, and project.urls (Homepage, Documentation, Repository, Bug Tracker, Changelog)
Removed gemlite optional-dependency extra that used direct git URLs (PEP 440 violation); equivalent packages are already in main dependencies
Added .github/workflows/publish.yml: automated PyPI publishing via Trusted Publishers (OIDC) on GitHub Release
Updated README.md: installation command changed from pip install git+<URL> to pip install onecomp
Added dist/ and build/ to .gitignore
Updated uv.lock

Default Parameter Changes¶

Changed Runner.__init__ default values for calibration parameters:
max_length: 512 → 2048
num_calibration_samples: 128 → 512
Pinned old default values explicitly in all example/ and tests/ files that previously relied on the defaults

Documentation¶

Updated docs/user-guide/configuration.md to reflect the new default values for max_length and num_calibration_samples
Added quantizer feature support table to docs/user-guide/basic-usage.md and docs/api/quantizers/base.md
Documents which quantizers support save_quantized_model() / create_quantized_model() and quantized-model PPL/ACC evaluation
Currently supported: GPTQ, DBF, AutoBitQuantizer (requires get_quant_config() and create_inference_layer())
Unsupported quantizers (RTN, JointQ, QUIP, CQ, ARB, QBB, Onebit): PPL/ACC evaluation automatically falls back to the dequantized (FP16) model
Updated the perplexity/accuracy evaluation note in basic-usage.md to reflect AutoBitQuantizer support and fallback behavior

[v0.5.0] 2026-03-30¶

New Feature: Post-quantization Workflow¶

Added PostQuantizationProcess abstract base class (onecomp/post_process/_base.py)
Defines the interface for post-quantization operations (e.g. block-wise PTQ, fine-tuning)
Added post_processes parameter to Runner.__init__
Accepts a list of PostQuantizationProcess instances
After quantization, builds a quantized model on CPU and executes each process in order
The processed model is stored as self.quantized_model
Updated Runner.calculate_perplexity and Runner.calculate_accuracy to use self.quantized_model if available (GPU transfer is handled automatically; device="auto" is resolved to "cuda")
Added LoRA SFT post-process implementation (onecomp/post_process/post_process_lora_sft.py)
Provides learning-based post-quantization fine-tuning for GPTQ-quantized models
Public API is exposed as PostProcessLoraSFT

New Feature: Rotation Preprocessing Pipeline (`onecomp/pre_process/`)¶

SpinQuant/OstQuant-based rotation preprocessing that reduces quantization error by learning optimal rotation matrices before quantization. Supports Llama and Qwen3 architectures.

Added prepare_rotated_model() (onecomp/pre_process/prepare_rotated_model.py): End-to-end pipeline — model loading → rotation/scaling training → rotation application → saving
Memory-optimized: moves model between CPU/GPU to reduce peak memory (e.g. Qwen3-32B: ~128GB → ~64GB)
Added RotatedModelConfig (onecomp/rotated_model_config.py): ModelConfig subclass that automatically registers Hadamard hooks on down_proj layers during load_model()
Added onecomp/pre_process/ package:
train_rotation.py: Training pipeline with PreprocessManager (R1/R2/S_* tensor management), HF Trainer subclass, apply_preprocess_train / apply_preprocess_eval
optimizer.py: SGDG — SGD on the Stiefel manifold with Cayley-retraction orthogonal updates (ported from SpinQuant)
quant_models.py: WeightQuantizer (RTN proxy) with per-channel / per-tensor / group-wise quantization; quantized decoder layers for Llama and Qwen3
rotation_utils.py: fuse_layer_norms, rotate_model, register_online_hadamard_hooks
hadamard_utils.py: Hadamard transform utilities and pre-computed matrices (ported from QuIP#)
modeling_llama.py / modeling_qwen3.py: Custom ForCausalLM classes that propagate R1 through the forward pass during training
preprocess_args.py: TrainingArguments subclass with SGDG-specific LR/momentum fields
Fixed _PreprocessTrainer to override create_optimizer() instead of create_optimizer_and_scheduler() for transformers >= 5.x compatibility (SGDG optimizer was silently replaced by AdamW)
Updated Runner.save_dequantized_model() and Runner.save_quantized_model() to warn when saving models loaded with additional preprocessing (e.g., Hadamard hooks)

Added JointQ Quantizer¶

Added new JointQ quantizer (onecomp/quantizer/jointq/)
Local-search-based post-training quantization method that minimizes ||Y - hat{W} X^T||_F^2
Supports both symmetric and asymmetric quantization (1–4 bits)
Group-wise quantization with configurable group size
Tikhonov regularization for over-fitting (X^T X + nλI)
Three initialization strategies: Clip-Optimize, Clip-Optimize with Error Propagation, and GPTQ

AutoBitQuantizer vLLM-compatible quantization_config¶

AutoBitQuantizer now emits mixed_gptq-compatible quantization_config (onecomp/quantizer/autobit/_autobit.py)
ILP solver now enforces fused-layer equality constraints (onecomp/quantizer/autobit/ilp.py)
vLLM fuses q/k/v → qkv_proj and gate/up → gate_up_proj

API changes¶

Made Runner.create_quantized_model() a public method (renamed from _create_quantized_model)
Builds a quantized model with quantized inference layers from quantizer.results
Returns (model, tokenizer) for use in evaluation, saving, and post-process workflows
Added Runner.save_quantized_model_pt() for saving post-processed models (e.g. LoRA-applied) as PyTorch .pt files
Uses torch.save to preserve custom module types such as LoRAGPTQLinear
Saves tokenizer files alongside the model
Added QuantizedModelLoader.load_quantized_model_pt() for loading .pt-format models
Counterpart to save_quantized_model_pt; uses torch.load to restore models with custom modules
Also available as onecomp.load_quantized_model_pt() convenience alias

Bug Fix: Onebit Quantizer¶

Fixed Onebit to declare flag_calibration=True and flag_hessian=True (onecomp/quantizer/onebit/_onebit.py)
Previously, Onebit computed the Hessian internally from input despite declaring all flags as False, causing a crash when used through quantize_without_calibration or chunked quantization paths
Now uses the Hessian provided by the Runner, consistent with other calibration-based quantizers (GPTQ, DBF, QUIP)

Quantizer Signature Consistency¶

Added input=None default to quantize_layer in RTN, CQ, QBB (onecomp/quantizer/{rtn,cq,qbb}/)
Aligns with the base Quantizer.quantize_layer(self, module, input=None, hessian=None) signature
Enables these quantizers to be used in Runner(quantizers=[...]) via the chunked quantization path
Added input=None, hessian=None defaults to Onebit.quantize_layer for the same reason

Examples¶

Added example/post_process/example_lora_sft.py: End-to-end demo — GPTQ 4-bit quantization + LoRA SFT (WikiText-2) + PPL evaluation + save/load with save_quantized_model_pt / load_quantized_model_pt
Added example/post_process/example_lora_sft_knowledge.py: Knowledge injection demo — teaches the quantized model about "OneCompression" via LoRA SFT and compares generation before/after
Added example/post_process/onecomp_knowledge.jsonl: Training data describing OneCompression for the knowledge injection example
Added example/example_jointq.py: JointQ 4-bit (groupsize=128) quantization example with dequantized model PPL evaluation
Added example/pre_process/example_llama_preprocess_rtn.py: Rotation preprocessing + RTN quantization (TinyLlama-1.1B)
Added example/pre_process/example_preprocess_save_load.py: Rotation preprocessing + GPTQ quantization → save → load → PPL verification
Added example/vllm_inference/example_gptq_vllm_inference.py: GPTQ + QEP quantization and vLLM inference end-to-end example
Added example/vllm_inference/example_autobit_vllm_inference.py: AutoBit mixed-precision quantization and vLLM inference example

Documentation¶

Added docs/user-guide/post-process.md: LoRA SFT user guide covering accuracy recovery, knowledge injection, save/load, key parameters, teacher distillation, intermediate block alignment, and vLLM limitations
Added docs/api/post_process.md: API reference for PostQuantizationProcess, PostProcessLoraSFT, and convenience variants
Updated docs/user-guide/examples.md with LoRA SFT code examples (accuracy recovery, knowledge injection, save/load)
Updated docs/api/runner.md to include create_quantized_model and save_quantized_model_pt
Updated docs/api/quantized_model_loader.md to include load_quantized_model_pt
Updated mkdocs.yml navigation with new post-process pages
Added docs/user-guide/pre-process.md: Rotation preprocessing user guide covering workflow, key parameters, save/load, and limitations
Added docs/api/pre_process.md: API reference for prepare_rotated_model and RotatedModelConfig
Updated docs/user-guide/examples.md with rotation preprocessing code examples (RTN, GPTQ with save/load)
Updated docs/api/index.md with RotatedModelConfig, prepare_rotated_model, and pre_process/ module structure
Updated docs/index.md Key Features with rotation preprocessing

Tests¶

Added smoke test for PostProcessLoraSFT (tests/onecomp/post_process/test_post_process_lora_sft.py)
Verifies that PostProcessLoraSFT.run() completes without error on TinyLlama with minimal settings
Checks LoRA layer injection, CPU placement, and eval mode after run
Includes Runner end-to-end integration test with post_processes parameter
Expanded and updated unit tests for DBF quantizer (tests/onecomp/quantizer/dbf/test_dbf.py)
Extended boundary and abnormal parameter cases; aligned with BaseQuantizeSpec and current DBF API
Expanded and updated unit tests for GPTQ quantizer (tests/onecomp/quantizer/gptq/test_gptq.py)
Extended boundary and abnormal parameter cases; aligned with BaseQuantizeSpec and current GPTQ API
Adjusted DBF and GPTQ quantizer implementations for test compatibility and consistency (onecomp/quantizer/dbf/_dbf.py, onecomp/quantizer/gptq/_gptq.py)
Fixed and improved JointQ unit tests (tests/onecomp/quantizer/jointq/test_jointq.py)
Use compute_dequantized_weight() instead of direct dequantized_weight access
Override boundary test to use CUDA with 128×128 layers for group_size compatibility
Skip CPU-only tests (JointQ is GPU-based)
Fix batch_size validation: >= 0 → >= 1 (onecomp/quantizer/jointq/_jointq.py)
Improved JointQ regression test (tests/onecomp/quantizer/jointq/test_quantize_regression.py)
Replaced exact tensor match with MSE-based quality check for environment portability
Hardcoded expected MSE in helper; removed .pth baseline file

[v0.4.3] 2026-03-26¶

Implement AutoBit to automatically determine bit-allocation¶

Add AutoBitQuantizer (onecomp/quantizer/autobit/_autobit.py) that automatically assigns optimal bit-width per module using ILP with considering activation-aware error (onecomp/quantizer/autobit/ilp.py) and DBF fallback (onecomp/quantizer/autobit/dbf_fallback.py) for ultra-low-bit targets ( <= target bit 2bit)
SCIP solver was utilized to solve ILP (onecomp/quantizer/autobit/ilp.py)
Sequentially load and forward each layer to collect activation and curvature statistics (onecomp/quantizer/autobit/activation_stats.py, onecomp/utils/blockwise.py)
Usage example is shown in (example/example3.py)
Add VRAM auto-estimation utility to derive target bit-width from available GPU memory (onecomp/utils/vram_estimator.py)

VLM and Multi-Architecture Support for Architecture-aware QEP¶

Extended _get_blocks to detect language_model sub-module and restrict block search to the text decoder (onecomp/qep/_quantize_with_qep_arch.py)
VLMs (Qwen3-VL, Gemma3, etc.) no longer return vision-encoder blocks
CausalLM behaviour is unchanged (falls back to full-model search)
Added __getattr__ proxy to Catcher to forward attribute access to the wrapped module (onecomp/qep/_quantize_with_qep_arch.py)
Prevents AttributeError when model code reads decoder-layer attributes (e.g. attention_type) before forward()
Changed get_blocks_and_inputs to capture block-level kwargs with batch=1 (onecomp/qep/_quantize_with_qep_arch.py)
Internally generated kwargs (position_embeddings, attention_mask, etc.) are now batch-size-independent
Avoids shape mismatches when reused with varying batch sizes in downstream functions
Added expand_kwargs_batch helper to expand batch=1 kwargs via Tensor.expand (zero-copy view) (onecomp/qep/_quantize_with_qep_arch.py)
Used in compute_hessian_and_crossterm and forward_input before each block forward call
Resolves failures on models requiring exact batch-dimension matching (e.g. Gemma3 sliding-window attention)
Added early termination and group skipping to run_quantize_with_qep_arch (onecomp/qep/_quantize_with_qep_arch.py)
Groups with no quantization targets are skipped (avoids unnecessary Hessian/cross-term computation)
Block loop exits once all target layers are quantized

End-to-end CLI tests¶

Added tests/onecomp/test_cli.py: end-to-end tests that verify onecomp TinyLlama/... CLI runs without errors
test_default_full_run: full default pipeline (AutoBit + QEP + eval + save) on GPU
Variant tests for individual options (--wbits, --no-qep, --total-vram-gb, --groupsize, --save-dir, etc.) on CPU
Variant tests are skipped by default; enable with RUN_CLI_VARIANT_TESTS=1
Uses python -m onecomp to avoid implicit uv sync that could modify the environment

Fixes¶

Fixed crash when DBF quantization fails with NaN/Inf (onecomp/quantizer/dbf/_dbf.py, onecomp/qep/_quantize_with_qep_arch.py)
_quantize_with_qep_arch.py: Catch ValueError/NotImplementedError from compute_dequantized_weight(), log the error, and keep QEP-adjusted weights for the failed layer
Fixed GemLite import crash when PyTorch version is incompatible (onecomp/quantizer/gemlite.py)
Broadened except ImportError to except (ImportError, AttributeError) so that GemLite gracefully falls back when torch lacks newer dtypes (e.g. float8_e8m0fnu)
Fixed test_dbf_gemlite.py to skip when GemLite is unavailable instead of crashing (tests/vllm_plugins/dbf/test_dbf_gemlite.py)

Dependency and documentation updates¶

Added vllm as an optional dependency (--extra vllm) in pyproject.toml
Prevents environment corruption caused by uv pip install vllm being overwritten by subsequent uv sync/uv run
Added torchvision to CUDA extras and [tool.uv.sources] in pyproject.toml to prevent CUDA version mismatch
Updated installation docs to reflect new extras (README.md, docs/getting-started/installation.md, docs/user-guide/vllm-inference.md)
Updated uv.lock

[v0.4.2] 2026-03-25¶

Unit tests for additional quantizers¶

Added unit tests for QBB, RTN, QUIP, ONEBIT, CQ, ARB, and JOINTQ
New test modules under tests/onecomp/quantizer/: test_qbb.py, test_rtn.py, test_quip.py, test_onebit.py, test_cq.py, test_arb.py, test_jointq.py
Shared test base and helpers updated in tests/onecomp/quantizer/test_module.py
Quantizer implementations adjusted for test compatibility: onecomp/quantizer/qbb/, onecomp/quantizer/rtn/, onecomp/quantizer/quip/, onecomp/quantizer/onebit/, onecomp/quantizer/arb/, onecomp/quantizer/jointq/ (and related *_impl.py); minor updates in onecomp/quantizer/dbf/_dbf.py, onecomp/quantizer/gptq/_gptq.py

vLLM plugin integration (DBF, Mixed-GPTQ)¶

Added vLLM plugin implementation for DBF and Mixed-GPTQ
New vllm_plugins package: vllm_plugins/__init__.py, DBF and GPTQ plugin entry points (vllm_plugins/dbf/, vllm_plugins/gptq/)
DBF: vllm_plugins/dbf/vllm_plugin.py and modules (vllm_plugins/dbf/modules/gemlite_linear.py, vllm_plugins/dbf/modules/naive.py); shared utilities in vllm_plugins/utils/module.py
GPTQ: vllm_plugins/gptq/vllm_plugin.py for Mixed-GPTQ inference
Tests: tests/vllm_plugins/dbf/test_dbf_gemlite.py, tests/vllm_plugins/dbf/test_dbf_naive.py
Package and dependency wiring in pyproject.toml

Fixes¶

Mixed-GPTQ: raise an error when quantization bit widths differ within the same shard (align with DBF behavior) (vllm_plugins/gptq/vllm_plugin.py)

[v0.4.1] 2026-03-19¶

Mixed GPTQ/DBF Save/Load¶

Extended Save/Load for mixed GPTQ and mixed DBF
QuantizedModelLoader now loads models with quant_method mixed_gptq or mixed_dbf (onecomp/quantized_model_loader.py)
effective_method treats mixed_* as the same tensor format as the base method (gptq/dbf) and resolves per-layer bit-width via quantization_bits
Load validates quant_method, quantization_bits, and modules_in_block_to_quantize from config.json's quantization_config
GPTQ
Added get_quant_config() to return save-time quantization_config with vLLM-compatible keys (onecomp/quantizer/gptq/_gptq.py)
Sets quant_method to mixed_gptq when module_wbits or mlp_wbits is present
New onecomp/quantizer/gptq/config.py: resolve_gptq_layer_wbits() resolves per-layer bit-width from quantization_config (priority: quantization_bits → module_wbits → mlp_wbits → bits/wbits)
GPTQLinear: extended to accept bit-width when restoring from saved state (onecomp/quantizer/gptq/gptq_layer.py)
DBF
Added get_quant_config() to return save-time quantization_config (onecomp/quantizer/dbf/_dbf.py)
New onecomp/quantizer/dbf/config.py: resolve_dbf_layer_bits() resolves per-layer bit-width from quantization_config (priority: quantization_bits → module_target_bits → mlp_target_bits → bits)
DoubleBinaryLinear: added argument for target bit-width (for mixed_dbf) (onecomp/quantizer/dbf/dbf_layer.py)
Shared
onecomp/utils/quant_config.py: added common helper get_quant_param() for quantization_config schema (fetch params by alias keys)
Quantizer.finalize_quant_config_for_save() hook added; subclasses (GPTQ/DBF) inject method-specific metadata (onecomp/quantizer/_quantizer.py)
runner: set quantization_config when saving (onecomp/runner.py)

Evaluation and benchmark (Runner and accuracy utils)¶

Runner: unified perplexity/accuracy evaluation via _calculate_evaluation() and added optional dequantized_model evaluation (onecomp/runner.py)
BREAKING: calculate_perplexity() / calculate_accuracy() now return a 3-tuple (original, dequantized, quantized) instead of 2-tuple (original, quantized). Existing code using orig, quant = runner.calculate_perplexity() must be updated to unpack three values. (onecomp/runner.py)
BREAKING: calculate_perplexity() / calculate_accuracy() default for original_model changed from True to False. To evaluate the original model, pass original_model=True explicitly. (onecomp/runner.py)
Benchmark: benchmark_perplexity() / benchmark_accuracy() now accept dequantized_model and quantized_model arguments. When dequantized_model=True, the result dict includes "{name}_dequantized" keys. (onecomp/runner.py)
lm_eval: added helper to create HFLM while temporarily disabling model.config.quantization_config for compatibility (onecomp/utils/accuracy.py)

Dequantized-weight API and compatibility fixes¶

Implemented compute_dequantized_weight() for GPTQ and DBF quantizers (onecomp/quantizer/gptq/_gptq.py, onecomp/quantizer/dbf/_dbf.py)
Removed dequantized_weight from Result classes and switched call sites to compute it via compute_dequantized_weight() (onecomp/quantizer/_quantizer.py, onecomp/runner_methods/*)
Fixed compatibility for quantization methods other than DBF/GPTQ in runner and QEP paths (onecomp/runner.py, onecomp/qep/_quantize_with_qep*.py)
Updated unit tests accordingly (tests/onecomp/test_qep_general_consistency.py)

`auto_run` / CLI improvements¶

Runner.auto_run(): added eval_original_model parameter to optionally evaluate the original (unquantized) model's perplexity and accuracy (default: False) (onecomp/runner.py)
Runner.auto_run(): evaluation now only computes quantized model metrics by default; pass eval_original_model=True to include original model metrics
CLI: added --eval-original flag to onecomp command (onecomp/cli.py)

GPU memory optimization for model saving¶

save_quantized_model() / save_dequantized_model() now load the base model on CPU (device_map="cpu") instead of GPU when building the save artifact (onecomp/runner.py). Previously the full original model was loaded onto GPU, which was unnecessary for saving and could cause OOM on memory-constrained setups.

Bug fix: Architecture-aware QEP group alignment¶

Fixed non-deterministic crash in compute_hessian_and_crossterm caused by groups_q and groups_f being ordered differently (onecomp/qep/_quantize_with_qep_arch.py). make_grouped_module groups modules by tensor identity (id() + data_ptr()), but after copy.deepcopy the CUDA memory allocator can assign different addresses, causing group misalignment between the quantized and full-precision blocks. Now groups_f is derived from groups_q by module name lookup instead of independent grouping.

Other fixes in this release¶

Refactored runner evaluation paths and fixed benchmark-based evaluation behavior (onecomp/runner.py, onecomp/utils/accuracy.py)
Examples: updated to pass original_model=True and quantized_model=True explicitly, and to unpack the new triple return value (example/example1.py, example/example2.py)

[v0.4.0] 2026-03-20¶

New Feature: `Runner.auto_run()` Classmethod¶

Added Runner.auto_run() classmethod for one-liner quantization (onecomp/runner.py)
Handles model loading, GPTQ quantization with QEP, evaluation (perplexity + accuracy), and model saving in a single call
Parameters: model_id, wbits (default: 4), groupsize (default: 128), device, qep (default: True), evaluate (default: True), save_dir (default: "auto")
Returns the configured Runner instance for further analysis
Made model_config parameter optional in Runner.__init__() (default: None) to allow Runner() without arguments

New Feature: `onecomp` CLI Command¶

Added onecomp CLI command for terminal-based quantization (onecomp/cli.py)
Usage: onecomp <model_id> [--wbits N] [--groupsize N] [--device DEV] [--no-qep] [--no-eval] [--save-dir DIR]
Thin wrapper around Runner.auto_run()
Added onecomp/__main__.py for python -m onecomp support
Registered console_scripts entry point in pyproject.toml

New Example¶

Added example/example_auto_run.py demonstrating one-liner quantization with Runner.auto_run()

Documentation¶

Updated docs/index.md: Quick Example now shows auto_run and CLI with tabbed view
Restructured docs/getting-started/quickstart.md: auto_run / CLI as the fastest path, step-by-step workflow below
Updated docs/getting-started/installation.md: Added onecomp command examples to Running Commands section
Updated docs/user-guide/basic-usage.md: Added "Quick Path: Runner.auto_run()" section
Updated docs/user-guide/examples.md: Added auto_run and CLI examples at the top
Added docs/user-guide/cli.md: Full CLI reference with all options and usage examples
Updated docs/api/runner.md: Added auto_run to mkdocstrings members
Updated docs/api/index.md: Added cli.py and __main__.py to Module Structure
Updated mkdocs.yml: Added CLI page to navigation
Added "Building Documentation Locally" section to README.md

Python Version Constraint¶

Restricted requires-python to ">=3.12, <3.14" in pyproject.toml
PyTorch does not yet provide wheels for Python 3.14, causing uv sync to fail when uv auto-selects CPython 3.14
Updated uv.lock to reflect the new Python version constraint

[v0.3.7] 2026-03-16¶

GPU Memory Optimization for Architecture-aware QEP¶

Added device field to QEPConfig (onecomp/qep/_qep_config.py)
Specifies the GPU device for block-wise QEP computation (default: "cuda:0")
Eliminates dependency on model_config.device and supports multi-GPU environments
Added device_map parameter to ModelConfig.load_model() (onecomp/model_config.py)
Allows overriding the device placement at load time without affecting existing callers
Optimized run_quantize_with_qep_arch to avoid loading the entire model onto GPU (onecomp/qep/_quantize_with_qep_arch.py)
Model is now loaded on CPU via load_model(device_map="cpu")
Calibration data is prepared on CPU
Only individual transformer blocks are moved to GPU during processing
Added StopForward exception and modified Catcher to halt the forward pass immediately after capturing first-block inputs, avoiding unnecessary computation through remaining layers (onecomp/qep/_quantize_with_qep_arch.py)
Added move_kwargs_to_device helper to recursively move keyword arguments to the target device (onecomp/qep/_quantize_with_qep_arch.py)
Fixed UnboundLocalError when a module in a group is not registered in quantizer.module_to_name (onecomp/qep/_quantize_with_qep_arch.py)

[v0.3.6] 2026-03-12¶

Completion of Save/Load Pipeline¶

Added new QuantizedModelLoader class (quantized_model_loader.py)
Automatically detects quantization config (GPTQ/DBF) from config.json and loads the model
Reads state_dict from safetensors, replaces layers with quantized layers, and loads into an empty model
Supports automatic device placement via accelerate
Top-level API: exported as onecomp.load_quantized_model()
Added GPTQLinear.from_saved_state() (reconstructs layer from safetensors state_dict)
Added DoubleBinaryLinear.from_saved_state() (same as above)
Revised config.json output format to enable direct inference with vLLM
Added list of quantized layer names to modules_in_block_to_quantize

Forward Implementation for `DoubleBinaryLinear` and `GPTQLinear`¶

GPTQLinear.forward(): Unpacks bit-packed weights → dequantizes → infers via F.linear() (fast path when using GemLite)
DoubleBinaryLinear.forward(): Implements 5-stage pipeline (scaling0 → binary_B → scaling2 → binary_A → scaling4) (GemLite compatible)

Expansion of Unit Tests¶

Added new common test base class BaseQuantizeSpec (test_module.py)
test_quantize_layer_returns: Validates type, shape, device, and dtype of quantization results (CPU/CUDA)
test_quantize_layer_reproducibility: Validates reproducibility with the same seed
test_parameters_boundary: Confirms correct behavior with boundary parameter values
test_parameters_abnormal_values_raise: Confirms exceptions are raised for abnormal parameters
test_cpu_gpu_output_match: Validates that CPU/GPU quantization results match
test_quantize_error: Validates quantization error is within tolerance on a 2-layer model
test_forward_error: Validates forward accuracy of inference layer (dequantized output vs inference layer output)
Added dedicated test classes for GPTQ and DBF (test_gptq.py, test_dbf.py)

Fixes to `DBF` and `GPTQ` Quantizers¶

Added parameter validation mechanism via validate_params() during setup() for DBF and GPTQ
Unified and revised dtype (FP16/INT32) and device (CPU) of quantization results

Build System Updates¶

Migrated package and project management to uv and pyproject.toml.
Applied black linter to scripts.

QEP Module Refactoring¶

Added QEPConfig dataclass (onecomp/qep/_qep_config.py)
Extracted quantize_with_qep logic into standalone function (onecomp/qep/_quantize_with_qep.py)
Added general flag to QEPConfig for dispatching between generic and architecture-aware implementations
Added stub for architecture-aware QEP quantization (onecomp/qep/_quantize_with_qep_arch.py)
Implemented architecture-aware QEP quantization with block-wise sequential pipeline (onecomp/qep/_quantize_with_qep_arch.py)
Added helper functions: _get_blocks, get_blocks_and_inputs, make_grouped_module, compute_hessian_and_crossterm, forward_input
Added Catcher class for capturing input activations of transformer blocks
Groups layers sharing the same input activations for efficient Hessian/cross-term computation
Extended Quantizer.quantize_with_qep() and adjust_weight() to accept precomputed hessian and delta_hatX (onecomp/quantizer/_quantizer.py)
Fixed _record_quantization_error to handle quant_input_activation=None for architecture-aware QEP (onecomp/quantizer/_quantizer.py)
Fixed architecture-aware QEP to respect num_layers and layer selection by checking quantizer.module_to_name (onecomp/qep/_quantize_with_qep_arch.py)
Fixed architecture-aware QEP to support exclude_layer_keywords: excluded layers are quantized without weight correction (onecomp/qep/_quantize_with_qep_arch.py)
Added consistency test between generic and architecture-aware QEP implementations (tests/onecomp/test_qep_general_consistency.py)
BREAKING: Changed QEPConfig.general default from True to False (architecture-aware implementation is now the default)

GPTQ Refactoring (`onecomp/quantizer/gptq/_gptq.py`)¶

BREAKING: Changed default sym from False to True (symmetric quantization) for both GPTQ class and run_gptq() function. Code relying on the previous asymmetric default must now explicitly pass sym=False.
Expanded GPTQ class docstring with full attribute descriptions and usage examples
Renamed H parameter to hessian in run_gptq() for clarity
Renamed local variable W to matrix_W in run_gptq() for clarity
Changed imports to from style (from torch import nn, from transformers import Conv1D)
Refactored GPTQExcecutor.__init__: replaced register_buffer with explicit None initialization for all attributes
Added docstrings to GPTQExcecutor.quantize(), enabled(), and ready() methods
Updated test_gptq.py boundary/abnormal parameters to reflect new sym=True default

[v0.3.5] 2026-03-05¶

Based on v0.3.4 codebase
Difference from v0.3.4: Changed comments to English

Changelog¶

Change log¶

[v1.2.2] 2026-07-10¶

Bug Fix¶

[v1.2.1] 2026-07-03¶

Security¶

[v1.2.0] 2026-06-08¶

Save/Load Support for JointQ, RTN, and OneBit Quantizers¶

Apple Silicon / macOS support¶

New Feature : Dashboard¶

New Feature: Global PTQ (Post-Training Quantization)¶

Evaluation:¶

for Developer: pre-commit¶

OneBitLinear Inference Layer Improvements¶

QuantizedModelLoader: OneBit Support¶

BlockWisePTQ / CBQ OneBit Optimizer Compatibility¶

Bug Fix¶

Tests¶

Dependencies¶

Documentation¶

Examples¶

[v1.1.1] 2026-05-21¶

New Feature: Quantization progress logging¶

Bug fixes: QEP + JointQ validation¶

Bug fixes: VLM save / load¶

Logging / observability tweaks¶

Tests¶

New Contributors¶

[v1.1.0] 2026-04-16¶

Gemma 3 / Gemma 4 & VLM Support¶

New Feature: LPCD (Layer-Projected Coordinate Descent)¶

New Feature: BlockWisePTQ¶

Quantizer Unification¶

Rotation Preprocessing Improvements¶

AutoBit: per-quantizer groupsize support¶

CalibrationConfig: unified calibration configuration¶

Calibration data: support WikiText-2, custom datasets, and C4 quality filtering¶

JointQ:¶

Breaking Changes¶

Bug Fix¶

Examples¶

Documentation¶

Tests¶

Benchmarking¶

Dependencies¶

Model Validation¶

Packaging¶

[v1.0.2] 2026-03-31¶

Bug Fix¶

[v1.0.1] 2026-03-31¶

Packaging¶

[v1.0.0] 2026-03-31¶

PyPI Publishing Setup¶

Default Parameter Changes¶

Documentation¶

[v0.5.0] 2026-03-30¶

New Feature: Post-quantization Workflow¶

New Feature: Rotation Preprocessing Pipeline (onecomp/pre_process/)¶

Added JointQ Quantizer¶

AutoBitQuantizer vLLM-compatible quantization_config¶

API changes¶

Bug Fix: Onebit Quantizer¶

Quantizer Signature Consistency¶

Examples¶

Documentation¶

Tests¶

[v0.4.3] 2026-03-26¶

Implement AutoBit to automatically determine bit-allocation¶

VLM and Multi-Architecture Support for Architecture-aware QEP¶

End-to-end CLI tests¶

Fixes¶

Dependency and documentation updates¶

[v0.4.2] 2026-03-25¶

Unit tests for additional quantizers¶

vLLM plugin integration (DBF, Mixed-GPTQ)¶

Fixes¶

[v0.4.1] 2026-03-19¶

Mixed GPTQ/DBF Save/Load¶

Evaluation and benchmark (Runner and accuracy utils)¶

Dequantized-weight API and compatibility fixes¶

New Feature: Rotation Preprocessing Pipeline (`onecomp/pre_process/`)¶

`auto_run` / CLI improvements¶

New Feature: `Runner.auto_run()` Classmethod¶

New Feature: `onecomp` CLI Command¶

Forward Implementation for `DoubleBinaryLinear` and `GPTQLinear`¶

Fixes to `DBF` and `GPTQ` Quantizers¶

GPTQ Refactoring (`onecomp/quantizer/gptq/_gptq.py`)¶