Changelog¶
Change log¶
[v1.1.1] 2026-05-21¶
New Feature: Quantization progress logging¶
- Added
QuantizationProgressTracker(onecomp/utils/quantization_progress.py) that emits a single[progress]INFO line per completed step with done/total, percentage, elapsed time, and a linear ETA estimate; supports an optionalthread_safe=Truemode for multi-GPU quantization - Added
report_progress: bool = Trueflag toRunner.__init__(onecomp/runner.py) and to the underlying entry pointsrun_chunked_quantization(onecomp/runner_methods/chunked_quantization.py),run_multi_gpu_quantization/run_quantization_phase(onecomp/runner_methods/multi_gpu_quantization.py),run_quantize_with_qep(onecomp/qep/_quantize_with_qep.py), andrun_quantize_with_qep_arch(onecomp/qep/_quantize_with_qep_arch.py) so long quantization runs (calibration, chunked, multi-GPU, QEP) report progress by default; passreport_progress=Falsefor quiet runs - Demoted some INFO-level per-layer / per-chunk logs to DEBUG to avoid duplication with the new
[progress]line (still available vialogging.basicConfig(level=logging.DEBUG)for deep debugging)
Bug fixes: QEP + JointQ validation¶
- Raise a clear error when
Runneris configured withqep=Trueand a quantizer that does not support QEP (currentlyJointQ). Previously the run failed deep insidequantize_with_qep/adjust_weightwith a confusing low-level error.Runner.check()now reports e.g. "Quantizer 'JointQ' (or one of its candidate quantizers) does not support QEP (Quantization Error Propagation). Set qep=False, or use a QEP-compatible quantizer (e.g., GPTQ, DBF, AutoBitQuantizer with QEP-compatible candidates)." Implementation: addedflag_qep_supported(defaultTrue) onQuantizer, set toFalseonJointQ, and propagated viaAutoBitQuantizer._sync_flags(onlyTruewhen all candidate quantizers support QEP) (quantizer/_quantizer.py,quantizer/jointq/_jointq.py,quantizer/autobit/_autobit.py,runner.py).
Bug fixes: VLM save / load¶
Runner.save_quantized_model()now copies all auxiliary*.jsonand*.jinjafiles (e.g.preprocessor_config.json,processor_config.json,special_tokens_map.json,chat_template.jinja) from the original model directory to the save directory, so the quantized model is fully self-contained for VLM / multimodal inference. Weight tensors (*.safetensors,*.bin,*.pt,*.pth), weight index files,config.jsonandgeneration_config.jsonare skipped, and any file already written bymodel.save_pretrained/tokenizer.save_pretrainedis preserved (runner.py).- Source-model directory resolution (incl.
huggingface_hub.snapshot_downloadfallback for Hub IDs) was extracted into a private helperRunner._resolve_source_model_dir()(runner.py). load_quantized_model()now re-establishes thelm_head<->embed_tokensweight tie for models withtie_word_embeddings=True.load_state_dict(..., assign=True)would otherwise leavelm_head.weightas the freshly initialised tensor (typicallyfloat16) whileembed_tokens.weightgot replaced with the checkpoint tensor (typicallybfloat16), causingRuntimeError: expected mat1 and mat2 to have the same dtypeat the finallm_headmatmul during generation. The re-tie is gated onlm_headstill being annn.Linearso it does not interfere whenlm_headitself was quantized (quantized_model_loader.py).load_quantized_model()now readstorch_dtypefromconfig.jsonwhen no explicittorch_dtypeis passed by the caller, so the empty model is built in the same dtype as the saved checkpoint. Previously it always defaulted totorch.float16, which left non-quantized VLM submodules (e.g.multi_modal_projectorin Cohere2Vision) at fp16 wheneverload_state_dict(..., assign=True)could not find their key in the state_dict (quantized_model_loader.py).load_quantized_model()now casts any leftoverfloat16parameters and buffers of non-quantized modules tomodel.config.torch_dtypeafter thelm_headre-tie step. Quantized layers (GPTQLinear,DoubleBinaryLinear) andfloat32params (e.g. fp32 LayerNorm in mixed-precision models) are deliberately untouched. This generalises the existinglm_headre-tie to any non-quantized module and fixes the dtype mismatch reported in issue 64-3 (RuntimeError: ... c10::Half != c10::BFloat16on VLM image features) (quantized_model_loader.py).- Added regression tests
tests/onecomp/runner/test_save_quantized_aux_files.py(auxiliary-file copy whitelist),tests/onecomp/runner/test_load_tied_embeddings.py(tied-embedding dtype round-trip) andtests/onecomp/runner/test_load_excluded_module_dtype.py(non-quantized module dtype handling, including config-based empty-model dtype default, fp16 safety-net cast, fp32 preservation, and quantized-layer skip). - Loosened
test_save_load_pipeline_tinyllama.pyandtest_save_load_pipeline_qwen3.pysave/load round-trip threshold from absolute1e-3to relative1%of the per-tensor logits magnitude (tests/onecomp/pre_process/test_save_load_pipeline_*.py). The original absolute bound was below fp16's representable precision once accumulated through the 22-28 decoder layers of TinyLlama / Qwen3, causing thegptq + save_dequantizedcases to fail on aarch64 + Blackwell (GB200) where cuBLAS picks slightly different reduction kernels than reference x86_64 / Hopper hosts. The save/load equivalence intent is preserved via the relative comparison, which is robust to platform-specific fp16 rounding noise. - Set
gpu_memory_utilization=0.78explicitly when constructingLLM(...)inexample/vllm_inference/example_autobit_vllm_inference.pyandexample/vllm_inference/example_gptq_vllm_inference.py. The vLLM default0.92cgroup-OOMs on UMA hosts (e.g. DGX Spark / GB200, 121.7 GiB UMA) because vLLM's startup memory check fails: the residual quantizer process leaves only ~106 GiB free, which is below0.92 * 121.7 = 111.96 GiB.0.78matches the value already used intests/vllm_plugins/gptq/test_mixed_gptq_e2e.pyand is documented in the workspaceslurm-submit.mdcrule.
Logging / observability tweaks¶
Runner._copy_auxiliary_files()now emits a matter-of-factINFO-level log when an auxiliary file from the original model directory is not copied because the destination already contains a file of the same name (typically becausetokenizer.save_pretrainedwrote it just before, or a previoussave_quantized_modelcall did). The new line is symmetrical to the existingCopied %s to save directoryentry so the auxiliary-copy step can be audited end-to-end (runner.py).QuantizedModelLoader._cast_fp16_to_target_dtype()now returns the list of fully-qualified parameter / buffer names whose dtype was actually converted instead of a plain count. The post-loadINFOlog inload_quantized_model()includes those names so it is obvious which non-quantized submodules were normalised by the safety-net cast (e.g.multi_modal_projector.linear_*in Cohere2Vision). Existing tests are updated accordingly and a new test pins the buffer-name reporting (quantized_model_loader.py,tests/onecomp/runner/test_load_excluded_module_dtype.py,tests/onecomp/runner/test_save_quantized_aux_files.py).QuantizedModelLoader.load_quantized_model()now detectstie_word_embeddings=Trueeven when the flag is nested in a sub-config (e.g.model.config.text_config.tie_word_embeddingsin Llama 3.2-Vision and other torchtune-derived VLMs) by walking one level of sub-configs. Previously the flag was only read from the top-levelmodel.config, so VLMs that placed it intext_configskipped the post-load re-tie; with HF deduplicatinglm_head.weightfor tied checkpoints, that leftlm_head.weightat the empty-model random initial values rather than re-pointing toembed_tokens.weight(quantized_model_loader.py).
Tests¶
- Added regression tests for the save/load fixes above:
tests/onecomp/runner/test_save_quantized_aux_files.py(auxiliary-file copy whitelist),tests/onecomp/runner/test_load_tied_embeddings.py(tied-embedding dtype round-trip), andtests/onecomp/runner/test_load_excluded_module_dtype.py(non-quantized module dtype handling, including config-based empty-model dtype default, fp16 safety-net cast, fp32 preservation, and quantized-layer skip). - Added
tests/onecomp/test_runner_check.pyfor the newqep=Truevalidation path: JointQ + qep=True raises a clearValueError, while JointQ + qep=False and GPTQ + qep=True both passRunner.check(). - Added
tests/onecomp/runner/test_load_tied_embeddings.py::test_should_retie_word_embeddings_*unit tests covering top-level, nested-text-config, all-False and unrelated-sub-attribute shapes.
New Contributors¶
- @sotanengel made their first contribution in #13
[v1.1.0] 2026-04-16¶
Gemma 3 / Gemma 4 & VLM Support¶
- Auto-detect
language_model/text_modelsub-modules insetup()so only the language model is quantized;vision_tower,audio_tower, etc. are automatically excluded (quantizer/_quantizer.py) - Added
unfuse_moe.py: MoE models (e.g. Gemma 4) store all expert weights as fused 3Dnn.Parametertensors (gate_up_proj [E, 2*inter, hidden],down_proj [E, hidden, inter]), but GPTQ and other layer-wise PTQ methods require 2Dnn.Linearlayers.unfuse_moe_experts()splits the fused tensors into per-expert modules, producing paths likeexperts.0.gate_proj,experts.0.up_proj,experts.0.down_proj(utils/unfuse_moe.py) - Set
quant_methodtomixed_gptqfor MoE models during save, enabling vLLM to handle a mix of quantized and unquantized expert layers viaUnquantizedFusedMoEMethod(runner.py) - Introduced
prepare_block_kwargsto reproduce Gemma 4-specific additional inputs during block-wise forward (runner_methods/chunked_quantization.py,qep/_quantize_with_qep_arch.py) _per_layer_inputs: pre-compute per-layer embeddings for all calibration samples_position_embeddings_map: hook intorotary_embto capture position embeddings per layer type_attention_mask_map: pre-compute masks per layer type viacreate_causal_mask/create_sliding_window_causal_mask- Updated
Catcher.forwardto accept*args(Gemma 4 passesper_layer_inputas a positional argument) - Added a guard to safely skip KV-shared layers where
k_proj/v_projare never called during forward and X^TX is not accumulated (runner_methods/chunked_quantization.py) - Added
token_type_ids(mm_token_type_ids) required by Gemma 4 to calibration data and PPL computation (utils/calibration.py,utils/perplexity.py) - Added
modelargument toprepare_calibration_dataset; model-specific inputs are appended viaadd_model_specific_inputs() - Changed
model.devicetonext(model.parameters()).deviceto support VLMdevice_map="auto" - Fixed MoE block partitioning (
down_projandrouter.projwere incorrectly placed in the same block) and relaxed Hessian input shape assertion for 2D tensors after router dispatch - Added layer-suffix fallback lookup for Gemma 3's shared sub-modules where
named_modules()paths differ fromstate_dict()keys (quantized_model_loader.py) save_quantized_model()now copiesprocessor_config.jsonfrom the source model so the quantized model directory is self-contained for multi-modal inference (runner.py)- Added skip logic in vLLM plugin to prevent vision / audio encoder layers from being incorrectly matched to language model quantization configs (
vllm_plugins/utils/module.py) - Override
ModelConfigdtypetobfloat16for Gemma 3/4 models whose values exceed the float16 range, preventing performance degradation (model_config.py) - Fixed an issue where non-language-model layers in multi-modal models were included in AutoBit bit allocation
- Bumped
transformersrequirement from>= 5.3.0to>= 5.5.0(pyproject.toml) - Gemma 4's
model_type: gemma4is registered inCONFIG_MAPPINGstarting from 5.5.0 (released 2026-04-02); 5.3.0 fails to load it - Added
cu130extra for the validation environment (NVIDIA B200, CUDA 13.0); undercu128, torch (cu130) and torchvision (cu128) had a CUDA version mismatch
New Feature: LPCD (Layer-Projected Coordinate Descent)¶
- Added
onecomp/lpcd/sub-package implementing the LPCD unified framework (arXiv:2512.01546) that extends layer-wise PTQ by jointly optimising sub-module groups (QK / VO / MLP / residual) with closed-form and gradient-based solvers - Added
benchmark/llama3-8b-lpcd-gptq/: Llama-3-8B LPCD+GPTQ SLURM array benchmark (Hydra configconf/benchmark_llama3-8b.yaml,quant_benchmark.py,README.mdwith WikiText-2 PPL / lm-eval-harness accuracy / quantization time for 4-bit and 3-bit × {q_proj·k_proj, v_proj·o_proj, up_proj·down_proj, all, residual} on NVIDIA B200) - Added
example/example_lpcd_gptq.py: TinyLlama GPTQ 3-bit (groupsize=128) + QEP + LPCD end-to-end example with residual-only closed-form refinement (enable_residual=True,use_closed_form=True) and original / dequantized / quantized perplexity reporting - Updated
README.md: added LPCD to Features, Examples, and Citation sections
New Feature: BlockWisePTQ¶
- Implemented
BlockWisePTQ.run()pipeline (onecomp/post_process/blockwise_ptq.py) - Phase 1: per-block distillation with teacher model (GPTQ / DBF / OneBit / Generic)
- Phase 2: Cross-Block Quantisation (CBQ) sliding-window optimisation (K=2)
- Teacher model loaded via
model_config.load_model(device_map="cpu") - Calibration inputs collected via Catcher hook on first transformer block
- Added
onecomp/post_process/_blockwise/sub-package (9 modules) helpers.py:collect_layer_inputs,auto_detect_quantization_strategy,get_transformer_layers,layer_kwargs_to_device, etc.- Phase 1 optimisers:
gptq_block_optimizer.py,dbf_block_optimizer.py,onebit_block_optimizer.py,generic_block_optimizer.py - Phase 2 CBQ optimisers:
gptq_cbq_optimizer.py,dbf_cbq_optimizer.py,onebit_cbq_optimizer.py - All optimisers use float32 promotion, best-state tracking with rollback, and hard MSE evaluation
- Set
use_gemlite=FalseinRunner.run_post_processes()(onecomp/runner.py) to avoid GemLite fp16-only Triton kernel incompatibility with float32 block optimisation - Added VLM support for BlockWisePTQ (Qwen3-VL, Qwen2.5-VL, etc.)
helpers.py:get_transformer_layers/_get_language_model_backbonehandlemodel.model.language_model.*pathmodel_config.py:load_model()falls back toAutoModelForImageTextToTextfor VLM configs- Fixed
Quantizer.calculate_hessian/calculate_delta_hatX(onecomp/quantizer/_quantizer.py): handle 2D activations from OPT-style architectures
Quantizer Unification¶
- Unified scale/zero/integer logic across
WeightQuantizer,RTN, andGPTQExcecutorfor both symmetric and asymmetric quantisation WeightQuantizer.configure/find_params/quantize(quant_models.py),STEQuantize.forward(quant_models.py),pseudo_quantize_tensor/quantize(rtn/quantizer.py),GPTQExcecutor.configure/find_params(gptq/_gptq.py)- Added optional MSE grid search (
mse,norm,grid) toWeightQuantizer,RTN, andprepare_rotated_model WeightQuantizer.configure/find_params(quant_models.py),pseudo_quantize_tensor(rtn/quantizer.py),run_rtn(rtn/rtn_impl.py),RTNdataclass /validate_params(rtn/_rtn.py),prepare_rotated_model(prepare_rotated_model.py),apply_preprocess_train/_insert_weight_quantizer(train_rotation.py)- Removed
perchannelandmaxshrinkfrom public APIs;perchannel=Trueis now always used internally - Removed from
RTNdataclass (rtn/_rtn.py) andprepare_rotated_model(prepare_rotated_model.py). Internally,run_rtn(rtn/rtn_impl.py) and_insert_weight_quantizer(train_rotation.py) passperchannel=Trueunconditionally. Low-level APIspseudo_quantize_tensor(rtn/quantizer.py) andWeightQuantizer.configure(quant_models.py) still accept the parameters
Rotation Preprocessing Improvements¶
- Added
"random_hadamard"and"hadamard"rotation modes (existing:"random","identity") PreprocessManager._ortho(train_rotation.py),_VALID_ROTATION_MODES(prepare_rotated_model.py)- Changed
prepare_rotated_modeldefaults:rotation_mode→"random_hadamard",num_calibration_samples→512 prepare_rotated_model(prepare_rotated_model.py),PreprocessManager.__init__(train_rotation.py)- Added input validation (
_validate_prepare_rotated_model_params) for allprepare_rotated_modelparameters _validate_prepare_rotated_model_params(prepare_rotated_model.py)- Added per-step and total execution time logging to
prepare_rotated_model prepare_rotated_model(prepare_rotated_model.py): timed sections for model load, calibration prep, training, reload,apply_preprocess_eval, and save- Added explicit
gradient_accumulation_steps=1toTrainingArgumentsdefaults TrainingArguments.gradient_accumulation_steps(preprocess_args.py)
AutoBit: per-quantizer groupsize support¶
AutoBitQuantizersupports each candidate quantizer'sgroupsizeindividually, enabling mixed group-size configurations (onecomp/quantizer/autobit/_autobit.py)- RTN error evaluation uses per-quantizer grouped quantisation (
onecomp/quantizer/autobit/ilp.py) - Added test for mixed group-size autobit (
tests/onecomp/quantizer/autobit/test_autobit.py) - Remove default quantizer from AutoBit; a quantizer must be explicitly provided. (
onecomp/quantizer/autobit/_autobit.py)
CalibrationConfig: unified calibration configuration¶
- Breaking change: Introduced
CalibrationConfigdataclass (onecomp/calibration/calibration_config.py) to consolidate all calibration-related parameters Runner.__init__now acceptscalibration_config: CalibrationConfiginstead of individual parameters (calibration_dataset,max_length,num_calibration_samples,calibration_strategy,calibration_seed,calibration_batch_size,num_layers_per_group)AutoBitQuantizernow acceptscalibration_config: CalibrationConfiginstead ofnum_calib_samples,calib_seqlen,calibration_datasetprepare_rotated_model()now acceptscalibration_config: CalibrationConfiginstead ofcalibration_dataset,max_length,num_calibration_samples,calibration_strategyBlockWisePTQnow acceptscalibration_config: CalibrationConfiginstead ofnum_calibration_samples,max_length,calibration_strategy,calibration_seed- When
calibration_config=None, default values are created automatically (calibration_dataset="c4",max_length=2048,num_calibration_samples=512) - New user-configurable parameters exposed via
CalibrationConfig:text_key,use_quality_filter,max_documents(previously hard-coded incalibration_data_loader.py) - Added cross-validation in
Runner.check(): if both Runner and AutoBitQuantizer specifycalibration_dataset, they must match - Removed backward-compatibility re-exports from
onecomp/utils/__init__.py(prepare_calibration_dataset,load_c4_for_aligned_chunks,load_c4_for_n_samples_min_length); import fromonecomp.calibrationinstead - Added unit tests for calibration module (
tests/onecomp/calibration/) - Internal functions now accept
CalibrationConfigdirectly instead of individual parameters: prepare_calibration_dataset()(calibration_data_loader.py): replaced 8 individual parameters withcalibration_config: CalibrationConfig(required argument)run_chunked_quantization()(runner_methods/chunked_quantization.py):calibration_dataset,max_length,num_calibration_samples,calibration_strategy,calibration_seed,calibration_batch_size,num_layers_per_groupreplaced bycalibration_configrun_multi_gpu_quantization(),run_capture_phase(),get_calibration_config_dict()(runner_methods/multi_gpu_quantization.py): same consolidationrun_quantize_with_qep()(qep/_quantize_with_qep.py),run_quantize_with_qep_arch()(qep/_quantize_with_qep_arch.py): same consolidationcollect_activation_stats_blockwise()(quantizer/autobit/activation_stats.py):num_samples,seqlen,calibration_datasetreplaced bycalibration_config- Code quality improvements:
CalibrationConfig.calibration_datasetdefaults to"c4"instead ofNone(no more implicit fallback)- Removed implicit dataset inheritance from quantizer to Runner; use explicit
CalibrationConfiginstead - Cross-validation uses
isinstance(quantizer, AutoBitQuantizer)instead of duck typing - Consolidated
from .calibration import CalibrationConfig, prepare_calibration_datasetinto single import - Added missing
"concat_rand"strategy toprepare_calibration_dataset()docstring - Documented
batch_sizeandnum_layers_per_groupinCalibrationConfigas chunked-quantization-only parameters
Calibration data: support WikiText-2, custom datasets, and C4 quality filtering¶
- Refactored
onecomp/utils/calibration.pyintoonecomp/calibration/folder with submodules calibration_data_loader.py: unified entry pointprepare_calibration_dataset()that dispatches by dataset name or file pathc4.py: C4 dataset loader with optional quality filtering (check_text_quality())wikitext.py: WikiText-2 dataset loader (new; loads fromSalesforce/wikitext)custom.py: custom dataset loader supporting.txt,.json,.jsonl,.csv,.tsv,.parquet,.arrow, and HuggingFace Dataset directorieschunking.py: shared chunking strategies (concat_chunk,concat_chunk_align,concat_rand,drop_head,drop_rand) extracted as reusable helpers- Added
calibration_datasetparameter toAutoBitQuantizerto specify the calibration data source (onecomp/quantizer/autobit/_autobit.py)
JointQ:¶
- Added
incremental_lambdaregularization mode (lambda_mode="incremental_lambda"): for each layer, tries increasing lambda values fromlambda_listwith warm start, accepting candidates that improve weight error without substantially degrading output error. Stops at the first rejection. Controlled bylambda_list,incremental_eps_y,incremental_eps_wparameters - Added
incremental_initial_skip_ew_thresholdto skip an unstable initiallambda=0.0candidate when its relative weight error is excessively large - Added
accepted_lambdafield toJointQResultto record the per-layer lambda chosen in incremental mode - Added
execute_post_processingoverride to log accepted lambda statistics (mean, median, min, max, per-value counts) after all layers are quantized - Added
regularization_modeparameter:"identity"(standard Tikhonov λI) or"diagonal"(default, importance-aware λ·diag(a) where a_i scales with activation magnitude). Diagonal mode reduces over-regularization of less important columns. Only supported withfixed_lambdamode - Added
regularization_gammaparameter (default 0.5): exponent for diagonal weights in"diagonal"mode; smaller values reduce the spread between weak and strong columns - Added initialization strategy control:
enable_clip_optimize,enable_clip_optimize_ep,enable_gptqparameters toJointQclass - Added
gptqattribute (GPTQinstance) toJointQclass for customizing GPTQ parameters (blocksize,percdamp,mse,q_grid,q_norm). Default GPTQ is auto-created frombits/group_size/symmetric - Replaced JointQ internal GPTQ module (
jointq/core/gptq.py) with OneComp GPTQ (onecomp.quantizer.gptq.GPTQ); GPTQ initial solution is now generated via the shared GPTQ implementation - Improved numerical stability of scale optimization for ill-conditioned matrices
- Fixed potential division-by-zero in scale computation
Breaking Changes¶
AutoBitQuantizer.enable_fused_groupsnow defaults toTrue(onecomp/quantizer/autobit/_autobit.py)- Ensures that vLLM fused layers (qkv_proj, gate_up_proj) are assigned the same quantizer (same bits and groupsize), which is required for vLLM inference.
- Previously defaulted to
False, which could cause vLLM to reject the model at load time when fused-layer constituents had mismatched configurations. Runner.auto_run()already setenable_fused_groups=True, so this change has no effect onauto_runusers.- Migration: If you use
AutoBitQuantizerwith candidate bit-widths not supported by vLLM (e.g.wbits=5), passenable_fused_groups=Falseexplicitly. - Quantisation levels unified to unsigned
[0, 2^b − 1](symmetric uses centred zero point); rounding order changed fromround(x/s + z)toround(x/s) + z. Outputs are not bit-exact with prior RTN versions - Changed
prepare_rotated_modeldefaults:rotation_mode"random"→"random_hadamard",num_calibration_samples128→512 - Introduced
CalibrationConfigdataclass; see CalibrationConfig section above for migration details JointQclass: removedbatch_sizeparameter (useonecomp.quantizer.jointq.core.quantize()directly if batch processing is needed)JointQ: GPTQ initial solution is now generated via OneComp GPTQ instead of the internal implementation. Quantization results are not bit-exact with prior versions (quality is equivalent or improved)JointQregularization defaults changed (onecomp/quantizer/jointq/_jointq.py)regularization_lambda:0.2→0.1regularization_mode:"identity"→"diagonal"- Quantization results are not bit-exact with prior versions.
- Migration: to reproduce the previous behavior, pass
JointQ(..., regularization_lambda=0.2, regularization_mode="identity")explicitly.
Bug Fix¶
- Fixed
model_config.py:load_model()VLM fallback did not trigger for models raising"Unrecognized configuration class"(e.g. Cohere2VisionForConditionalGeneration). Added the error pattern to_vlm_hints - Fixed
gptq/_gptq.py: Cholesky decomposition inrun_gptqcould fail withLinAlgErroron ill-conditioned Hessians (observed on large VLMs at deeper layers). Extracted_compute_inverse_hessian()with progressive damping fallback (up to 5 retries, 10x damping increase per retry). No impact on normal operation - Fixed
TypeErrorinQuantLinear.forwardwhenS_qkscaling was applied to MLP layers (onecomp/pre_process/quant_models.py) - Fixed
JointQgroup_size=None(per-channel quantization) raisingTypeError - Fixed wrong module grouping in
make_grouped_modulewhere GC-drivenid()reuse caused attention projections (q/k/v) and MLP projections (gate/up) to be merged into the same group. (qep/_quantize_with_qep_arch.py) - Fixed silent weight corruption in
GPTQLinearwhenqzero=0was stored through the GPTQ v1 zero-point path (onecomp/quantizer/gptq/gptq_layer.py) - Root cause: AutoGPTQ v1 stores
raw_zero - 1, soqzero=0becomes-1; without masking, its sign-extended bits corrupted neighboring packed slots - Pack-side fix (
_pack_rows): mask each value with(1 << wbits) - 1before shift/OR (2/4/8-bit and 3-bit paths) - Forward-side fix (
GPTQLinear.forward): apply& wbits_maskafter the v1+1restoration so stored2^wbits - 1wraps back to0; thegptq_v2path remains unchanged - Added regression tests for per-slot pack corruption, packed/unpacked forward paths, the
gptq_v2branch, and thefrom_saved_statepath for GPTQ v1 tensors - NOTE: If you have GPTQ models quantized with previous versions, please re-quantize them with this release, as they may contain corrupted internal data.
Examples¶
- Added
example/example_custom_calibration.py: DemonstratesCalibrationConfigwith a custom calibration dataset (Python code snippets inexample/data/python_calibration.txt). Quantizes TinyLlama with GPTQ 3-bit using both default C4 and custom Python-code calibration, then compares inference outputs across multiple prompts to show how calibration data choice affects quantization quality. - Added
example/post_process/example_blockwise_ptq.py: GPTQ 4-bit quantization + BlockWisePTQ (Phase 1 greedy + Phase 2 CBQ) with PPL comparison - Added
example/example_lpcd_gptq.py: TinyLlama GPTQ 3-bit (groupsize=128) + QEP + LPCD end-to-end example (residual-only closed-form refinement, original / dequantized / quantized perplexity reporting) - Updated
example/vllm_inference/example_gptq_vllm_inference.py: changed model toTinyLlama-1.1B-Chat-v1.0(chat model), disabled QEP, addedCalibrationConfig(num_calibration_samples=128, max_length=512) - Added
NOTEcomments across allexample/scripts (example_gptq.py,example_qep_gptq.py,example_jointq.py,example_autobit.py,example_save_load.py,example_lpcd_gptq.py,example_custom_calibration.py,post_process/example_blockwise_ptq.py,post_process/example_lora_sft.py,post_process/example_lora_sft_knowledge.py,pre_process/example_preprocess_save_load.py,vllm_inference/example_gptq_vllm_inference.py) clarifying that the compactCalibrationConfig(max_length=512, num_calibration_samples=128)settings are demo-only and recommending theCalibrationConfig()defaults (max_length=2048,num_calibration_samples=512) for real quantisation;qep=Falseexamples additionally recommend settingbatch_sizeso thatRunner.quantize_with_calibration_chunkedis used with large calibration data
Documentation¶
- Updated
docs/algorithms/jointq.md: added incremental lambda mode description, acceptance criteria, diagonal regularization mode, new parameters (lambda_mode,lambda_list,incremental_eps_y,incremental_eps_w,incremental_initial_skip_ew_threshold,regularization_mode,regularization_gamma), and usage examples - Added
docs/algorithms/lpcd.md: LPCD overview, motivation, supported submodule groups, usage examples, QEP relationship, parameters, and current support - Added
docs/api/lpcd_config.mdand updatedmkdocs.ymlnavigation to include LPCD in the Algorithms and API Reference sections - Updated LPCD references across docs:
docs/index.md,docs/algorithms/overview.md,docs/user-guide/basic-usage.md,docs/user-guide/configuration.md,docs/user-guide/examples.md, anddocs/getting-started/quickstart.md - Updated
docs/algorithms/rtn.md: corrected defaults, added MSE parameters, updated algorithm description - Updated
docs/user-guide/pre-process.md: expanded key parameters table, added validation note - Added "Chat with Open WebUI" section to
docs/user-guide/vllm-inference.md: step-by-step guide for connecting Open WebUI to a vLLM server (Docker / pip install with dedicated venv, connection settings, chat usage) - Added Open WebUI mention to
README.mdFeatures and vLLM Inference sections, anddocs/index.mdKey Features - Fixed broken
example/example1.pyreferences inREADME.mdanddocs/getting-started/installation.md(replaced withexample/example_gptq.py) - Added
example/example_custom_calibration.pytoREADME.mdExamples table under new "Calibration" category - Removed scratch files from
example/:buf.py,buf2.py,run_example.sh - Fixed outdated install command in
docs/user-guide/cli.md(git+URL→pip install onecomp) and added PyTorch prerequisite with link to installation guide - Removed duplicate "Chunked Calibration" section (was a copy of JointQ section) in
docs/user-guide/examples.md - Added missing
CalibrationConfigimport to Block-wise PTQ code snippet indocs/user-guide/examples.md - Fixed eval support note in
docs/getting-started/quickstart.mdto include AutoBitQuantizer (was "GPTQ and DBF only") - Added algorithm pages:
docs/algorithms/autobit.md(ILP-based mixed-precision) anddocs/algorithms/jointq.md(joint optimization) - Added AutoBit and JointQ with links to
docs/algorithms/overview.mdAvailable Algorithms table - Added
docs/api/quantizers/onebit.md(OneBit API reference) - Updated
mkdocs.ymlnav: added AutoBit/JointQ algorithm pages, OneBit API page; renamed Post-Process nav title to include Block-wise PTQ - Added example script links to
docs/user-guide/pre-process.md - Updated
docs/algorithms/jointq.md: added initialization strategy parameters, GPTQ customization examples, removedbatch_sizefrom parameter table, addedgroup_size=Noneper-channel documentation - Added a Troubleshooting section to
docs/user-guide/vllm-inference.mddescribing how to bypass the unconditional DeepGEMM (FP8) kernel warmup for non-FP8 quantization (GPTQ / DBF / Mixed-GPTQ) by settingVLLM_USE_DEEP_GEMM=0andVLLM_DEEP_GEMM_WARMUP=skip
Tests¶
- Updated
test_jointq.py: addedincremental_lambdaboundary parameters (lambda_mode,lambda_list,incremental_eps_y,incremental_eps_w,incremental_initial_skip_ew_threshold), abnormal parameter cases (invalid mode, empty list, negative values), GPU integration tests (test_incremental_lambda_basic,test_incremental_lambda_single_step),_accept_candidateunit tests covering all acceptance rules and edge cases; removedbatch_sizefrom boundary/abnormal parameter tests - Added
test_prepare_rotated_model.py: validation, E2E pipeline, output threshold (80 combinations), save/load round-trip - Added
test_weight_quantizer.py: RTN/GPTQ consistency, symmetric/asymmetric, group-wise, MSE, STE - Expanded
test_rtn.py: MSE boundary/abnormal parameters - Added vLLM mixed group-size tests (
tests/vllm_plugins/gptq/test_mixed_gptq.py,tests/vllm_plugins/gptq/test_mixed_gptq_e2e.py) - Updated
regression_quantize_helper.py: updatedEXPECTED_MSEbaseline for OneComp GPTQ integration
Benchmarking¶
- Updated all benchmark directories for v1.1.0
- Results will be added after benchmark runs complete.
Dependencies¶
- Updated
pyproject.tomlanduv.lock: addedhydra-coretodevoptional dependencies - Added LPCD tests (
tests/onecomp/lpcd/, 25 cases) test_lpcd_config.py:LPCDConfigdefault / custom values, dataclass field set, top-levelfrom onecomp import LPCDConfig(CPU only)test_lpcd_metrics.py:make_lpcd_metrics()dispatch on synthetic Llama / Qwen3 blocks for everyenable_*flag combination,NotImplementedErrorfor unsupported architectures,LpcdMetricGroup.mark_as_ready/is_refineablestate transitions (CPU only, no weight download)test_lpcd_runner.py: end-to-end GPTQ + QEP + LPCD on the first TinyLlama decoder block — smoke (Runner.run()completes, all linear layers quantized, dequantized weights finite), QEP + LPCD combination with explicitQEPConfig, behavioural checks (residual-only LPCD modifieso_proj/down_projbeyond the QEP-only baseline while pre-attentionq/k/v_projmatch the baseline bit-for-bit); auto-skipped on non-CUDA hosts viapytest.mark.skipif
Model Validation¶
- Added
model_validation/README.md: parent overview of the operational validation suite that exercises OneComp's end-to-end quantize → save → load → inference workflow across multiple architectures and sizes. Provides cross-recipe At-a-Glance status tables (per-model × per-recipe quantization status, and per-model × per-recipe(save, transformers inference, vllm inference)status), per-recipe result tables, and a Summary section that explicitly cautions against cross-recipe PPL comparison (PPL is reported only as a per-recipe sanity check; the compact calibration in use sits below typical research settings, partly because the calibration size has to fit the DGX Spark 128 GB UMA budget for 7–8B models with QEP on). - Added
model_validation/gptq/: Hydra-driven GPTQ (wbits=4,groupsize=128,qep=False) end-to-end validation across three phases: - Phase 1 — quantize + save (
validate_gptq.py,conf/validate.yaml):CalibrationConfig(max_length=512, num_calibration_samples=128), saved viarunner.save_quantized_model(...), reports original / quantized PPL onwikitext-2-raw-v1. - Phase 2 — load + greedy generation (
validate_load.py,conf/validate_load.yaml): reloads each saved directory viaload_quantized_modeland runs"Fujitsu is"withmax_new_tokens=32.torch_dtypeis overridable (float16/bfloat16/float32/null); gemma-4-E2B requiresbfloat16because the loader's defaultfloat16triggers aHalf/BFloat16mismatch atlm_head. - Phase 3 — vLLM offline inference (
validate_vllm.py,conf/validate_vllm.yaml): reloads each saved directory via vLLM's offlineLLMinterface (OneComp's vLLM plugin is auto-registered, so no explicitquantization=argument is required) and runs"Fujitsu is"withtemperature=0.0,max_tokens=32,enforce_eager=True,max_model_len=512.LLM(...)andllm.generate(...)are kept insidemain()behindif __name__ == "__main__":so vLLM worker subprocess re-imports do not recursively spawn new engines. Currentlypendingfor all five models. - Added
model_validation/qep_gptq/: Hydra-driven GPTQ (wbits=4,groupsize=128,qep=True) end-to-end validation script (validate_gptq.py,conf/validate.yaml,README.md). Calibration:max_length=1024,num_calibration_samples=128(reduced from defaults to keep 7–8B models within the DGX Spark 128 GB UMA budget with QEP on). Quantize + save only; load / inference is not exercised in this subdirectory yet. - Added
model_validation/autobit/: Hydra-driven AutoBit (target_bit=4,qep=False) end-to-end validation script (validate_autobit.py,conf/validate.yaml,README.md). CandidatesGPTQ(wbits=b, groupsize=128) for b in (2, 3, 4, 8),assignment_strategy="activation_aware",CalibrationConfig(max_length=512, num_calibration_samples=128). Quantize + save only. - Updated
model_validation/autobit_qep/: AutoBit (target_bit=4,qep=True) validation. Reduced calibration tomax_length=1024,num_calibration_samples=128to keep 7–8B models within the DGX Spark 128 GB UMA budget. README expanded with per-model bit-assignment counts (GPTQ_<b>_gs128: <count> layers) for TinyLlama-1.1B, Llama-2-7B, Llama-3-8B, Qwen3-8B, and gemma-4-E2B; documents the bimodal 8-bit / 2-bit ILP collapse on gemma-4-E2B (every module in the first 15 transformer blocks → 8-bit, remaining 20 blocks → 2-bit; quantized PPL diverges to ~10^14, reproduced after reducing calibration frommax_length=2048, num_calibration_samples=512) and lists candidate follow-ups (restrict candidate set, disable QEP, switchassignment_strategy). - Added
model_validation/jointq/: Hydra-driven JointQ (bits=4,group_size=128,symmetric=True,qep=False) end-to-end validation script (validate_jointq.py,conf/validate.yaml,README.md). Calibration:CalibrationConfig(max_length=512, num_calibration_samples=128). JointQ does not currently expose a quantized-inference layer (nosave_quantized_model/create_quantized_modelpath), so quality is sanity-checked on the dequantized model (weights reconstructed from JointQ's quantization parameters); save / inference are reported asn/ain the parent At-a-Glance table. - All five recipes share the same model selection contract: a single model selected via either
model_id(Hugging Face Hub) ormodel_path(local directory), with any field inconf/validate*.yamloverridable on the command line. Default validation set across all recipes is TinyLlama-1.1B, gemma-4-E2B (base), Llama-2-7B, Llama-3-8B, and Qwen3-8B.
Packaging¶
- Bumped minimum
transformersrequirement from>=5.3.0to>=5.5.0(pyproject.toml) - Added
cu130optional-dependency extra and thepytorch-cu130wheel index (https://download.pytorch.org/whl/cu130) for CUDA 13 hosts (e.g. NVIDIA B200) (pyproject.toml) - Pinned the
vllmextra tovllm>=0.10to prevent uv from falling back to legacy versions whose source build requiresCUDA_HOME(pyproject.toml) - Added uv
conflictsdeclarations between thevllmextra and thecpu/cu118/cu121/cu124/cu126/cu128extras: vLLM>=0.20requirestorch>=2.10, which is only published forcu130. This forcesvllmto be installed only with--extra cu130and prevents silent fallback to avllmversion incompatible withtransformers>=5at runtime (pyproject.toml) - Restricted
tool.uv.environmentstosys_platform == 'linux'andpython_full_version >= '3.12', < '3.14'to skip lock splits for unused Windows and out-of-range Python versions (pyproject.toml) - Added
hydraextra topyproject.tomlsohydra-core(used byexample/example_autobit.pyand themodel_validation/{gptq,qep_gptq,autobit,autobit_qep,jointq}/validate_*.pyscripts) installs in one step viauv sync --extra <cuXXX> --extra hydraorpip install "onecomp[hydra]", instead of a separatepip install hydra-coreafter sync. Documented the new extra inREADME.mdand themodel_validation/*/README.mdfiles. Themodel_validation/gptq/Phase 3 (vLLM inference) additionally requires thevllmextra (uv sync --extra <cuXXX> --extra hydra --extra vllmorpip install "onecomp[hydra]" vllm).
[v1.0.2] 2026-03-31¶
Bug Fix¶
- Fixed
ImportErrorwhen runningonecompCLI without matplotlib installed;AutoBitQuantizer._visualize()now catches the import error and logs a warning instead of crashing
[v1.0.1] 2026-03-31¶
Packaging¶
- Moved
matplotlibfromdevextra to newvisualizeextra inpyproject.toml - Made
visualize_bit_assignmentimport lazy inonecomp/quantizer/autobit/__init__.pyto avoid requiring matplotlib at import time - Updated installation instructions in
README.mdanddocs/getting-started/installation.mdto reflect the newvisualizeextra - Updated
uv.lock
[v1.0.0] 2026-03-31¶
PyPI Publishing Setup¶
- Added PyPI metadata to
pyproject.toml:keywords,classifiers, andproject.urls(Homepage, Documentation, Repository, Bug Tracker, Changelog) - Removed
gemliteoptional-dependency extra that used direct git URLs (PEP 440 violation); equivalent packages are already in maindependencies - Added
.github/workflows/publish.yml: automated PyPI publishing via Trusted Publishers (OIDC) on GitHub Release - Updated
README.md: installation command changed frompip install git+<URL>topip install onecomp - Added
dist/andbuild/to.gitignore - Updated
uv.lock
Default Parameter Changes¶
- Changed
Runner.__init__default values for calibration parameters: max_length:512→2048num_calibration_samples:128→512- Pinned old default values explicitly in all
example/andtests/files that previously relied on the defaults
Documentation¶
- Updated
docs/user-guide/configuration.mdto reflect the new default values formax_lengthandnum_calibration_samples - Added quantizer feature support table to
docs/user-guide/basic-usage.mdanddocs/api/quantizers/base.md - Documents which quantizers support
save_quantized_model()/create_quantized_model()and quantized-model PPL/ACC evaluation - Currently supported: GPTQ, DBF, AutoBitQuantizer (requires
get_quant_config()andcreate_inference_layer()) - Unsupported quantizers (RTN, JointQ, QUIP, CQ, ARB, QBB, Onebit): PPL/ACC evaluation automatically falls back to the dequantized (FP16) model
- Updated the perplexity/accuracy evaluation note in
basic-usage.mdto reflect AutoBitQuantizer support and fallback behavior
[v0.5.0] 2026-03-30¶
New Feature: Post-quantization Workflow¶
- Added
PostQuantizationProcessabstract base class (onecomp/post_process/_base.py) - Defines the interface for post-quantization operations (e.g. block-wise PTQ, fine-tuning)
- Added
post_processesparameter toRunner.__init__ - Accepts a list of
PostQuantizationProcessinstances - After quantization, builds a quantized model on CPU and executes each process in order
- The processed model is stored as
self.quantized_model - Updated
Runner.calculate_perplexityandRunner.calculate_accuracyto useself.quantized_modelif available (GPU transfer is handled automatically;device="auto"is resolved to"cuda") - Added LoRA SFT post-process implementation (
onecomp/post_process/post_process_lora_sft.py) - Provides learning-based post-quantization fine-tuning for GPTQ-quantized models
- Public API is exposed as
PostProcessLoraSFT
New Feature: Rotation Preprocessing Pipeline (onecomp/pre_process/)¶
SpinQuant/OstQuant-based rotation preprocessing that reduces quantization error by learning optimal rotation matrices before quantization. Supports Llama and Qwen3 architectures.
- Added
prepare_rotated_model()(onecomp/pre_process/prepare_rotated_model.py): End-to-end pipeline — model loading → rotation/scaling training → rotation application → saving - Memory-optimized: moves model between CPU/GPU to reduce peak memory (e.g. Qwen3-32B: ~128GB → ~64GB)
- Added
RotatedModelConfig(onecomp/rotated_model_config.py):ModelConfigsubclass that automatically registers Hadamard hooks ondown_projlayers duringload_model() - Added
onecomp/pre_process/package: train_rotation.py: Training pipeline withPreprocessManager(R1/R2/S_* tensor management), HFTrainersubclass,apply_preprocess_train/apply_preprocess_evaloptimizer.py:SGDG— SGD on the Stiefel manifold with Cayley-retraction orthogonal updates (ported from SpinQuant)quant_models.py:WeightQuantizer(RTN proxy) with per-channel / per-tensor / group-wise quantization; quantized decoder layers for Llama and Qwen3rotation_utils.py:fuse_layer_norms,rotate_model,register_online_hadamard_hookshadamard_utils.py: Hadamard transform utilities and pre-computed matrices (ported from QuIP#)modeling_llama.py/modeling_qwen3.py: CustomForCausalLMclasses that propagate R1 through the forward pass during trainingpreprocess_args.py:TrainingArgumentssubclass with SGDG-specific LR/momentum fields- Fixed
_PreprocessTrainerto overridecreate_optimizer()instead ofcreate_optimizer_and_scheduler()for transformers >= 5.x compatibility (SGDG optimizer was silently replaced by AdamW) - Updated
Runner.save_dequantized_model()andRunner.save_quantized_model()to warn when saving models loaded with additional preprocessing (e.g., Hadamard hooks)
Added JointQ Quantizer¶
- Added new
JointQquantizer (onecomp/quantizer/jointq/) - Local-search-based post-training quantization method that minimizes ||Y - hat{W} X^T||_F^2
- Supports both symmetric and asymmetric quantization (1–4 bits)
- Group-wise quantization with configurable group size
- Tikhonov regularization for over-fitting (X^T X + nλI)
- Three initialization strategies: Clip-Optimize, Clip-Optimize with Error Propagation, and GPTQ
AutoBitQuantizer vLLM-compatible quantization_config¶
AutoBitQuantizernow emitsmixed_gptq-compatiblequantization_config(onecomp/quantizer/autobit/_autobit.py)- ILP solver now enforces fused-layer equality constraints (
onecomp/quantizer/autobit/ilp.py) - vLLM fuses q/k/v →
qkv_projand gate/up →gate_up_proj
API changes¶
- Made
Runner.create_quantized_model()a public method (renamed from_create_quantized_model) - Builds a quantized model with quantized inference layers from
quantizer.results - Returns
(model, tokenizer)for use in evaluation, saving, and post-process workflows - Added
Runner.save_quantized_model_pt()for saving post-processed models (e.g. LoRA-applied) as PyTorch.ptfiles - Uses
torch.saveto preserve custom module types such asLoRAGPTQLinear - Saves tokenizer files alongside the model
- Added
QuantizedModelLoader.load_quantized_model_pt()for loading.pt-format models - Counterpart to
save_quantized_model_pt; usestorch.loadto restore models with custom modules - Also available as
onecomp.load_quantized_model_pt()convenience alias
Bug Fix: Onebit Quantizer¶
- Fixed
Onebitto declareflag_calibration=Trueandflag_hessian=True(onecomp/quantizer/onebit/_onebit.py) - Previously, Onebit computed the Hessian internally from
inputdespite declaring all flags asFalse, causing a crash when used throughquantize_without_calibrationor chunked quantization paths - Now uses the Hessian provided by the Runner, consistent with other calibration-based quantizers (GPTQ, DBF, QUIP)
Quantizer Signature Consistency¶
- Added
input=Nonedefault toquantize_layerinRTN,CQ,QBB(onecomp/quantizer/{rtn,cq,qbb}/) - Aligns with the base
Quantizer.quantize_layer(self, module, input=None, hessian=None)signature - Enables these quantizers to be used in
Runner(quantizers=[...])via the chunked quantization path - Added
input=None, hessian=Nonedefaults toOnebit.quantize_layerfor the same reason
Examples¶
- Added
example/post_process/example_lora_sft.py: End-to-end demo — GPTQ 4-bit quantization + LoRA SFT (WikiText-2) + PPL evaluation + save/load withsave_quantized_model_pt/load_quantized_model_pt - Added
example/post_process/example_lora_sft_knowledge.py: Knowledge injection demo — teaches the quantized model about "OneCompression" via LoRA SFT and compares generation before/after - Added
example/post_process/onecomp_knowledge.jsonl: Training data describing OneCompression for the knowledge injection example - Added
example/example_jointq.py: JointQ 4-bit (groupsize=128) quantization example with dequantized model PPL evaluation - Added
example/pre_process/example_llama_preprocess_rtn.py: Rotation preprocessing + RTN quantization (TinyLlama-1.1B) - Added
example/pre_process/example_preprocess_save_load.py: Rotation preprocessing + GPTQ quantization → save → load → PPL verification - Added
example/vllm_inference/example_gptq_vllm_inference.py: GPTQ + QEP quantization and vLLM inference end-to-end example - Added
example/vllm_inference/example_autobit_vllm_inference.py: AutoBit mixed-precision quantization and vLLM inference example
Documentation¶
- Added
docs/user-guide/post-process.md: LoRA SFT user guide covering accuracy recovery, knowledge injection, save/load, key parameters, teacher distillation, intermediate block alignment, and vLLM limitations - Added
docs/api/post_process.md: API reference forPostQuantizationProcess,PostProcessLoraSFT, and convenience variants - Updated
docs/user-guide/examples.mdwith LoRA SFT code examples (accuracy recovery, knowledge injection, save/load) - Updated
docs/api/runner.mdto includecreate_quantized_modelandsave_quantized_model_pt - Updated
docs/api/quantized_model_loader.mdto includeload_quantized_model_pt - Updated
mkdocs.ymlnavigation with new post-process pages - Added
docs/user-guide/pre-process.md: Rotation preprocessing user guide covering workflow, key parameters, save/load, and limitations - Added
docs/api/pre_process.md: API reference forprepare_rotated_modelandRotatedModelConfig - Updated
docs/user-guide/examples.mdwith rotation preprocessing code examples (RTN, GPTQ with save/load) - Updated
docs/api/index.mdwithRotatedModelConfig,prepare_rotated_model, andpre_process/module structure - Updated
docs/index.mdKey Features with rotation preprocessing
Tests¶
- Added smoke test for
PostProcessLoraSFT(tests/onecomp/post_process/test_post_process_lora_sft.py) - Verifies that
PostProcessLoraSFT.run()completes without error on TinyLlama with minimal settings - Checks LoRA layer injection, CPU placement, and eval mode after run
- Includes Runner end-to-end integration test with
post_processesparameter - Expanded and updated unit tests for DBF quantizer (
tests/onecomp/quantizer/dbf/test_dbf.py) - Extended boundary and abnormal parameter cases; aligned with
BaseQuantizeSpecand current DBF API - Expanded and updated unit tests for GPTQ quantizer (
tests/onecomp/quantizer/gptq/test_gptq.py) - Extended boundary and abnormal parameter cases; aligned with
BaseQuantizeSpecand current GPTQ API - Adjusted DBF and GPTQ quantizer implementations for test compatibility and consistency (
onecomp/quantizer/dbf/_dbf.py,onecomp/quantizer/gptq/_gptq.py) - Fixed and improved JointQ unit tests (
tests/onecomp/quantizer/jointq/test_jointq.py) - Use
compute_dequantized_weight()instead of directdequantized_weightaccess - Override boundary test to use CUDA with 128×128 layers for group_size compatibility
- Skip CPU-only tests (JointQ is GPU-based)
- Fix
batch_sizevalidation:>= 0→>= 1(onecomp/quantizer/jointq/_jointq.py) - Improved JointQ regression test (
tests/onecomp/quantizer/jointq/test_quantize_regression.py) - Replaced exact tensor match with MSE-based quality check for environment portability
- Hardcoded expected MSE in helper; removed
.pthbaseline file
[v0.4.3] 2026-03-26¶
Implement AutoBit to automatically determine bit-allocation¶
- Add
AutoBitQuantizer(onecomp/quantizer/autobit/_autobit.py) that automatically assigns optimal bit-width per module using ILP with considering activation-aware error (onecomp/quantizer/autobit/ilp.py) and DBF fallback (onecomp/quantizer/autobit/dbf_fallback.py) for ultra-low-bit targets ( <= target bit 2bit) - SCIP solver was utilized to solve ILP (
onecomp/quantizer/autobit/ilp.py) - Sequentially load and forward each layer to collect activation and curvature statistics (
onecomp/quantizer/autobit/activation_stats.py,onecomp/utils/blockwise.py) - Usage example is shown in (
example/example3.py) - Add VRAM auto-estimation utility to derive target bit-width from available GPU memory (
onecomp/utils/vram_estimator.py)
VLM and Multi-Architecture Support for Architecture-aware QEP¶
- Extended
_get_blocksto detectlanguage_modelsub-module and restrict block search to the text decoder (onecomp/qep/_quantize_with_qep_arch.py) - VLMs (Qwen3-VL, Gemma3, etc.) no longer return vision-encoder blocks
- CausalLM behaviour is unchanged (falls back to full-model search)
- Added
__getattr__proxy toCatcherto forward attribute access to the wrapped module (onecomp/qep/_quantize_with_qep_arch.py) - Prevents
AttributeErrorwhen model code reads decoder-layer attributes (e.g.attention_type) beforeforward() - Changed
get_blocks_and_inputsto capture block-level kwargs with batch=1 (onecomp/qep/_quantize_with_qep_arch.py) - Internally generated kwargs (position_embeddings, attention_mask, etc.) are now batch-size-independent
- Avoids shape mismatches when reused with varying batch sizes in downstream functions
- Added
expand_kwargs_batchhelper to expand batch=1 kwargs viaTensor.expand(zero-copy view) (onecomp/qep/_quantize_with_qep_arch.py) - Used in
compute_hessian_and_crosstermandforward_inputbefore each block forward call - Resolves failures on models requiring exact batch-dimension matching (e.g. Gemma3 sliding-window attention)
- Added early termination and group skipping to
run_quantize_with_qep_arch(onecomp/qep/_quantize_with_qep_arch.py) - Groups with no quantization targets are skipped (avoids unnecessary Hessian/cross-term computation)
- Block loop exits once all target layers are quantized
End-to-end CLI tests¶
- Added
tests/onecomp/test_cli.py: end-to-end tests that verifyonecomp TinyLlama/...CLI runs without errors test_default_full_run: full default pipeline (AutoBit + QEP + eval + save) on GPU- Variant tests for individual options (
--wbits,--no-qep,--total-vram-gb,--groupsize,--save-dir, etc.) on CPU - Variant tests are skipped by default; enable with
RUN_CLI_VARIANT_TESTS=1 - Uses
python -m onecompto avoid implicituv syncthat could modify the environment
Fixes¶
- Fixed crash when DBF quantization fails with NaN/Inf (
onecomp/quantizer/dbf/_dbf.py,onecomp/qep/_quantize_with_qep_arch.py) _quantize_with_qep_arch.py: CatchValueError/NotImplementedErrorfromcompute_dequantized_weight(), log the error, and keep QEP-adjusted weights for the failed layer- Fixed GemLite import crash when PyTorch version is incompatible (
onecomp/quantizer/gemlite.py) - Broadened
except ImportErrortoexcept (ImportError, AttributeError)so that GemLite gracefully falls back whentorchlacks newer dtypes (e.g.float8_e8m0fnu) - Fixed
test_dbf_gemlite.pyto skip when GemLite is unavailable instead of crashing (tests/vllm_plugins/dbf/test_dbf_gemlite.py)
Dependency and documentation updates¶
- Added
vllmas an optional dependency (--extra vllm) inpyproject.toml - Prevents environment corruption caused by
uv pip install vllmbeing overwritten by subsequentuv sync/uv run - Added
torchvisionto CUDA extras and[tool.uv.sources]inpyproject.tomlto prevent CUDA version mismatch - Updated installation docs to reflect new extras (
README.md,docs/getting-started/installation.md,docs/user-guide/vllm-inference.md) - Updated
uv.lock
[v0.4.2] 2026-03-25¶
Unit tests for additional quantizers¶
- Added unit tests for QBB, RTN, QUIP, ONEBIT, CQ, ARB, and JOINTQ
- New test modules under
tests/onecomp/quantizer/:test_qbb.py,test_rtn.py,test_quip.py,test_onebit.py,test_cq.py,test_arb.py,test_jointq.py - Shared test base and helpers updated in
tests/onecomp/quantizer/test_module.py - Quantizer implementations adjusted for test compatibility:
onecomp/quantizer/qbb/,onecomp/quantizer/rtn/,onecomp/quantizer/quip/,onecomp/quantizer/onebit/,onecomp/quantizer/arb/,onecomp/quantizer/jointq/(and related*_impl.py); minor updates inonecomp/quantizer/dbf/_dbf.py,onecomp/quantizer/gptq/_gptq.py
vLLM plugin integration (DBF, Mixed-GPTQ)¶
- Added vLLM plugin implementation for DBF and Mixed-GPTQ
- New
vllm_pluginspackage:vllm_plugins/__init__.py, DBF and GPTQ plugin entry points (vllm_plugins/dbf/,vllm_plugins/gptq/) - DBF:
vllm_plugins/dbf/vllm_plugin.pyand modules (vllm_plugins/dbf/modules/gemlite_linear.py,vllm_plugins/dbf/modules/naive.py); shared utilities invllm_plugins/utils/module.py - GPTQ:
vllm_plugins/gptq/vllm_plugin.pyfor Mixed-GPTQ inference - Tests:
tests/vllm_plugins/dbf/test_dbf_gemlite.py,tests/vllm_plugins/dbf/test_dbf_naive.py - Package and dependency wiring in
pyproject.toml
Fixes¶
- Mixed-GPTQ: raise an error when quantization bit widths differ within the same shard (align with DBF behavior) (
vllm_plugins/gptq/vllm_plugin.py)
[v0.4.1] 2026-03-19¶
Mixed GPTQ/DBF Save/Load¶
- Extended Save/Load for mixed GPTQ and mixed DBF
QuantizedModelLoadernow loads models withquant_methodmixed_gptqormixed_dbf(onecomp/quantized_model_loader.py)effective_methodtreats mixed_* as the same tensor format as the base method (gptq/dbf) and resolves per-layer bit-width viaquantization_bits- Load validates
quant_method,quantization_bits, andmodules_in_block_to_quantizefromconfig.json'squantization_config - GPTQ
- Added
get_quant_config()to return save-timequantization_configwith vLLM-compatible keys (onecomp/quantizer/gptq/_gptq.py) - Sets
quant_methodtomixed_gptqwhenmodule_wbitsormlp_wbitsis present - New
onecomp/quantizer/gptq/config.py:resolve_gptq_layer_wbits()resolves per-layer bit-width fromquantization_config(priority: quantization_bits → module_wbits → mlp_wbits → bits/wbits) GPTQLinear: extended to accept bit-width when restoring from saved state (onecomp/quantizer/gptq/gptq_layer.py)- DBF
- Added
get_quant_config()to return save-timequantization_config(onecomp/quantizer/dbf/_dbf.py) - New
onecomp/quantizer/dbf/config.py:resolve_dbf_layer_bits()resolves per-layer bit-width fromquantization_config(priority: quantization_bits → module_target_bits → mlp_target_bits → bits) DoubleBinaryLinear: added argument for target bit-width (for mixed_dbf) (onecomp/quantizer/dbf/dbf_layer.py)- Shared
onecomp/utils/quant_config.py: added common helperget_quant_param()forquantization_configschema (fetch params by alias keys)Quantizer.finalize_quant_config_for_save()hook added; subclasses (GPTQ/DBF) inject method-specific metadata (onecomp/quantizer/_quantizer.py)runner: setquantization_configwhen saving (onecomp/runner.py)
Evaluation and benchmark (Runner and accuracy utils)¶
- Runner: unified perplexity/accuracy evaluation via
_calculate_evaluation()and added optionaldequantized_modelevaluation (onecomp/runner.py) - BREAKING:
calculate_perplexity()/calculate_accuracy()now return a 3-tuple(original, dequantized, quantized)instead of 2-tuple(original, quantized). Existing code usingorig, quant = runner.calculate_perplexity()must be updated to unpack three values. (onecomp/runner.py) - BREAKING:
calculate_perplexity()/calculate_accuracy()default fororiginal_modelchanged fromTruetoFalse. To evaluate the original model, passoriginal_model=Trueexplicitly. (onecomp/runner.py) - Benchmark:
benchmark_perplexity()/benchmark_accuracy()now acceptdequantized_modelandquantized_modelarguments. Whendequantized_model=True, the result dict includes"{name}_dequantized"keys. (onecomp/runner.py) - lm_eval: added helper to create
HFLMwhile temporarily disablingmodel.config.quantization_configfor compatibility (onecomp/utils/accuracy.py)
Dequantized-weight API and compatibility fixes¶
- Implemented
compute_dequantized_weight()for GPTQ and DBF quantizers (onecomp/quantizer/gptq/_gptq.py,onecomp/quantizer/dbf/_dbf.py) - Removed
dequantized_weightfrom Result classes and switched call sites to compute it viacompute_dequantized_weight()(onecomp/quantizer/_quantizer.py,onecomp/runner_methods/*) - Fixed compatibility for quantization methods other than DBF/GPTQ in runner and QEP paths (
onecomp/runner.py,onecomp/qep/_quantize_with_qep*.py) - Updated unit tests accordingly (
tests/onecomp/test_qep_general_consistency.py)
auto_run / CLI improvements¶
Runner.auto_run(): addedeval_original_modelparameter to optionally evaluate the original (unquantized) model's perplexity and accuracy (default:False) (onecomp/runner.py)Runner.auto_run(): evaluation now only computes quantized model metrics by default; passeval_original_model=Trueto include original model metrics- CLI: added
--eval-originalflag toonecompcommand (onecomp/cli.py)
GPU memory optimization for model saving¶
save_quantized_model()/save_dequantized_model()now load the base model on CPU (device_map="cpu") instead of GPU when building the save artifact (onecomp/runner.py). Previously the full original model was loaded onto GPU, which was unnecessary for saving and could cause OOM on memory-constrained setups.
Bug fix: Architecture-aware QEP group alignment¶
- Fixed non-deterministic crash in
compute_hessian_and_crosstermcaused bygroups_qandgroups_fbeing ordered differently (onecomp/qep/_quantize_with_qep_arch.py).make_grouped_modulegroups modules by tensor identity (id()+data_ptr()), but aftercopy.deepcopythe CUDA memory allocator can assign different addresses, causing group misalignment between the quantized and full-precision blocks. Nowgroups_fis derived fromgroups_qby module name lookup instead of independent grouping.
Other fixes in this release¶
- Refactored runner evaluation paths and fixed benchmark-based evaluation behavior (
onecomp/runner.py,onecomp/utils/accuracy.py) - Examples: updated to pass
original_model=Trueandquantized_model=Trueexplicitly, and to unpack the new triple return value (example/example1.py,example/example2.py)
[v0.4.0] 2026-03-20¶
New Feature: Runner.auto_run() Classmethod¶
- Added
Runner.auto_run()classmethod for one-liner quantization (onecomp/runner.py) - Handles model loading, GPTQ quantization with QEP, evaluation (perplexity + accuracy), and model saving in a single call
- Parameters:
model_id,wbits(default: 4),groupsize(default: 128),device,qep(default: True),evaluate(default: True),save_dir(default: "auto") - Returns the configured
Runnerinstance for further analysis - Made
model_configparameter optional inRunner.__init__()(default:None) to allowRunner()without arguments
New Feature: onecomp CLI Command¶
- Added
onecompCLI command for terminal-based quantization (onecomp/cli.py) - Usage:
onecomp <model_id> [--wbits N] [--groupsize N] [--device DEV] [--no-qep] [--no-eval] [--save-dir DIR] - Thin wrapper around
Runner.auto_run() - Added
onecomp/__main__.pyforpython -m onecompsupport - Registered
console_scriptsentry point inpyproject.toml
New Example¶
- Added
example/example_auto_run.pydemonstrating one-liner quantization withRunner.auto_run()
Documentation¶
- Updated
docs/index.md: Quick Example now showsauto_runand CLI with tabbed view - Restructured
docs/getting-started/quickstart.md:auto_run/ CLI as the fastest path, step-by-step workflow below - Updated
docs/getting-started/installation.md: Addedonecompcommand examples to Running Commands section - Updated
docs/user-guide/basic-usage.md: Added "Quick Path:Runner.auto_run()" section - Updated
docs/user-guide/examples.md: Addedauto_runand CLI examples at the top - Added
docs/user-guide/cli.md: Full CLI reference with all options and usage examples - Updated
docs/api/runner.md: Addedauto_runto mkdocstrings members - Updated
docs/api/index.md: Addedcli.pyand__main__.pyto Module Structure - Updated
mkdocs.yml: Added CLI page to navigation - Added "Building Documentation Locally" section to
README.md
Python Version Constraint¶
- Restricted
requires-pythonto">=3.12, <3.14"inpyproject.toml - PyTorch does not yet provide wheels for Python 3.14, causing
uv syncto fail when uv auto-selects CPython 3.14 - Updated
uv.lockto reflect the new Python version constraint
[v0.3.7] 2026-03-16¶
GPU Memory Optimization for Architecture-aware QEP¶
- Added
devicefield toQEPConfig(onecomp/qep/_qep_config.py) - Specifies the GPU device for block-wise QEP computation (default:
"cuda:0") - Eliminates dependency on
model_config.deviceand supports multi-GPU environments - Added
device_mapparameter toModelConfig.load_model()(onecomp/model_config.py) - Allows overriding the device placement at load time without affecting existing callers
- Optimized
run_quantize_with_qep_archto avoid loading the entire model onto GPU (onecomp/qep/_quantize_with_qep_arch.py) - Model is now loaded on CPU via
load_model(device_map="cpu") - Calibration data is prepared on CPU
- Only individual transformer blocks are moved to GPU during processing
- Added
StopForwardexception and modifiedCatcherto halt the forward pass immediately after capturing first-block inputs, avoiding unnecessary computation through remaining layers (onecomp/qep/_quantize_with_qep_arch.py) - Added
move_kwargs_to_devicehelper to recursively move keyword arguments to the target device (onecomp/qep/_quantize_with_qep_arch.py) - Fixed
UnboundLocalErrorwhen a module in a group is not registered inquantizer.module_to_name(onecomp/qep/_quantize_with_qep_arch.py)
[v0.3.6] 2026-03-12¶
Completion of Save/Load Pipeline¶
- Added new
QuantizedModelLoaderclass (quantized_model_loader.py) - Automatically detects quantization config (GPTQ/DBF) from
config.jsonand loads the model - Reads state_dict from safetensors, replaces layers with quantized layers, and loads into an empty model
- Supports automatic device placement via
accelerate - Top-level API: exported as
onecomp.load_quantized_model() - Added
GPTQLinear.from_saved_state()(reconstructs layer from safetensors state_dict) - Added
DoubleBinaryLinear.from_saved_state()(same as above) - Revised
config.jsonoutput format to enable direct inference with vLLM - Added list of quantized layer names to
modules_in_block_to_quantize
Forward Implementation for DoubleBinaryLinear and GPTQLinear¶
GPTQLinear.forward(): Unpacks bit-packed weights → dequantizes → infers viaF.linear()(fast path when using GemLite)DoubleBinaryLinear.forward(): Implements 5-stage pipeline (scaling0 → binary_B → scaling2 → binary_A → scaling4) (GemLite compatible)
Expansion of Unit Tests¶
- Added new common test base class
BaseQuantizeSpec(test_module.py) test_quantize_layer_returns: Validates type, shape, device, and dtype of quantization results (CPU/CUDA)test_quantize_layer_reproducibility: Validates reproducibility with the same seedtest_parameters_boundary: Confirms correct behavior with boundary parameter valuestest_parameters_abnormal_values_raise: Confirms exceptions are raised for abnormal parameterstest_cpu_gpu_output_match: Validates that CPU/GPU quantization results matchtest_quantize_error: Validates quantization error is within tolerance on a 2-layer modeltest_forward_error: Validates forward accuracy of inference layer (dequantized output vs inference layer output)- Added dedicated test classes for GPTQ and DBF (
test_gptq.py,test_dbf.py)
Fixes to DBF and GPTQ Quantizers¶
- Added parameter validation mechanism via
validate_params()duringsetup()forDBFandGPTQ - Unified and revised dtype (FP16/INT32) and device (CPU) of quantization results
Build System Updates¶
- Migrated package and project management to
uvandpyproject.toml. - Applied
blacklinter to scripts.
QEP Module Refactoring¶
- Added
QEPConfigdataclass (onecomp/qep/_qep_config.py) - Extracted
quantize_with_qeplogic into standalone function (onecomp/qep/_quantize_with_qep.py) - Added
generalflag toQEPConfigfor dispatching between generic and architecture-aware implementations - Added stub for architecture-aware QEP quantization (
onecomp/qep/_quantize_with_qep_arch.py) - Implemented architecture-aware QEP quantization with block-wise sequential pipeline (
onecomp/qep/_quantize_with_qep_arch.py) - Added helper functions:
_get_blocks,get_blocks_and_inputs,make_grouped_module,compute_hessian_and_crossterm,forward_input - Added
Catcherclass for capturing input activations of transformer blocks - Groups layers sharing the same input activations for efficient Hessian/cross-term computation
- Extended
Quantizer.quantize_with_qep()andadjust_weight()to accept precomputedhessiananddelta_hatX(onecomp/quantizer/_quantizer.py) - Fixed
_record_quantization_errorto handlequant_input_activation=Nonefor architecture-aware QEP (onecomp/quantizer/_quantizer.py) - Fixed architecture-aware QEP to respect
num_layersand layer selection by checkingquantizer.module_to_name(onecomp/qep/_quantize_with_qep_arch.py) - Fixed architecture-aware QEP to support
exclude_layer_keywords: excluded layers are quantized without weight correction (onecomp/qep/_quantize_with_qep_arch.py) - Added consistency test between generic and architecture-aware QEP implementations (
tests/onecomp/test_qep_general_consistency.py) - BREAKING: Changed
QEPConfig.generaldefault fromTruetoFalse(architecture-aware implementation is now the default)
GPTQ Refactoring (onecomp/quantizer/gptq/_gptq.py)¶
- BREAKING: Changed default
symfromFalsetoTrue(symmetric quantization) for bothGPTQclass andrun_gptq()function. Code relying on the previous asymmetric default must now explicitly passsym=False. - Expanded
GPTQclass docstring with full attribute descriptions and usage examples - Renamed
Hparameter tohessianinrun_gptq()for clarity - Renamed local variable
Wtomatrix_Winrun_gptq()for clarity - Changed imports to
fromstyle (from torch import nn,from transformers import Conv1D) - Refactored
GPTQExcecutor.__init__: replacedregister_bufferwith explicitNoneinitialization for all attributes - Added docstrings to
GPTQExcecutor.quantize(),enabled(), andready()methods - Updated
test_gptq.pyboundary/abnormal parameters to reflect newsym=Truedefault
[v0.3.5] 2026-03-05¶
- Based on v0.3.4 codebase
- Difference from v0.3.4: Changed comments to English