CLI Reference¶

OneComp provides the onecomp command for quantizing models directly from the terminal.

Installation¶

First, install PyTorch for your environment (see Installation for details). Then install OneComp:

pip install onecomp

Verify the installation:

onecomp --version

You can also use python -m onecomp as an alternative:

python -m onecomp --version

Usage¶

onecomp [-h] [--wbits WBITS] [--total-vram-gb GB] [--groupsize GROUPSIZE]
        [--device DEVICE] [--no-qep] [--no-eval] [--eval-original]
        [--save-dir SAVE_DIR] [--version]
        model_id

Positional Arguments¶

Argument	Description
`model_id`	Hugging Face model ID or local path

Options¶

Option	Default	Description
`--wbits WBITS`	`None` (auto)	Target bitwidth. When omitted, estimated from VRAM
`--total-vram-gb GB`	`None` (auto)	VRAM budget in GB for bitwidth estimation. When omitted, detected from GPU
`--groupsize GROUPSIZE`	`128`	GPTQ group size (`-1` to disable grouping)
`--device DEVICE`	`cuda:0`	Device to place the model on
`--no-qep`		Disable QEP (enabled by default)
`--no-eval`		Skip perplexity and accuracy evaluation
`--eval-original`		Also evaluate the original (unquantized) model
`--save-dir SAVE_DIR`	`auto`	Save directory (`auto` = derived from model name, `none` to skip)
`--version`		Show version and exit

Examples¶

Basic usage (AutoBit with VRAM auto-estimation)¶

Quantize with defaults (AutoBit mixed-precision + QEP, evaluate, auto-save):

onecomp TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

Specify VRAM budget¶

onecomp meta-llama/Llama-2-7b-hf --total-vram-gb 8

Fixed bitwidth (skip VRAM estimation)¶

onecomp meta-llama/Llama-2-7b-hf --wbits 4

3-bit quantization¶

onecomp meta-llama/Llama-2-7b-hf --wbits 3

Custom group size¶

onecomp meta-llama/Llama-2-7b-hf --wbits 4 --groupsize 64

Without QEP¶

onecomp meta-llama/Llama-2-7b-hf --no-qep

Skip evaluation (quantize and save only)¶

onecomp meta-llama/Llama-2-7b-hf --no-eval

Custom save directory¶

onecomp meta-llama/Llama-2-7b-hf --save-dir ./my_quantized_model

Skip saving¶

onecomp meta-llama/Llama-2-7b-hf --save-dir none

Evaluate original model too¶

onecomp meta-llama/Llama-2-7b-hf --eval-original

Use a specific GPU¶

onecomp meta-llama/Llama-2-7b-hf --device cuda:1

Default Behavior¶

When run with no options, the onecomp command:

Loads the model and tokenizer from Hugging Face Hub
Estimates the target bitwidth from available VRAM
Quantizes with AutoBit (ILP-based mixed-precision) + QEP
Evaluates perplexity (wikitext-2) and zero-shot accuracy
Saves the quantized model to <model_name>-autobit-<X>bit/

Equivalent Python API¶

The CLI is a thin wrapper around Runner.auto_run. Every CLI invocation maps directly to the Python API:

onecomp meta-llama/Llama-2-7b-hf --wbits 3 --no-qep --save-dir ./output

is equivalent to:

from onecomp import Runner

Runner.auto_run(
    model_id="meta-llama/Llama-2-7b-hf",
    wbits=3,
    qep=False,
    save_dir="./output",
)