Skip to content

CLI Reference

OneComp provides the onecomp command for quantizing models directly from the terminal.

Installation

First, install PyTorch for your environment (see Installation for details). Then install OneComp:

pip install onecomp

Verify the installation:

onecomp --version

You can also use python -m onecomp as an alternative:

python -m onecomp --version

Usage

onecomp [-h] [--wbits WBITS] [--total-vram-gb GB] [--groupsize GROUPSIZE]
        [--device DEVICE] [--no-qep] [--no-eval] [--eval-original]
        [--save-dir SAVE_DIR] [--version]
        model_id

Positional Arguments

Argument Description
model_id Hugging Face model ID or local path

Options

Option Default Description
--wbits WBITS None (auto) Target bitwidth. When omitted, estimated from VRAM
--total-vram-gb GB None (auto) VRAM budget in GB for bitwidth estimation. When omitted, detected from GPU
--groupsize GROUPSIZE 128 GPTQ group size (-1 to disable grouping)
--device DEVICE cuda:0 Device to place the model on
--no-qep Disable QEP (enabled by default)
--no-eval Skip perplexity and accuracy evaluation
--eval-original Also evaluate the original (unquantized) model
--save-dir SAVE_DIR auto Save directory (auto = derived from model name, none to skip)
--version Show version and exit

Examples

Basic usage (AutoBit with VRAM auto-estimation)

Quantize with defaults (AutoBit mixed-precision + QEP, evaluate, auto-save):

onecomp TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

Specify VRAM budget

onecomp meta-llama/Llama-2-7b-hf --total-vram-gb 8

Fixed bitwidth (skip VRAM estimation)

onecomp meta-llama/Llama-2-7b-hf --wbits 4

3-bit quantization

onecomp meta-llama/Llama-2-7b-hf --wbits 3

Custom group size

onecomp meta-llama/Llama-2-7b-hf --wbits 4 --groupsize 64

Without QEP

onecomp meta-llama/Llama-2-7b-hf --no-qep

Skip evaluation (quantize and save only)

onecomp meta-llama/Llama-2-7b-hf --no-eval

Custom save directory

onecomp meta-llama/Llama-2-7b-hf --save-dir ./my_quantized_model

Skip saving

onecomp meta-llama/Llama-2-7b-hf --save-dir none

Evaluate original model too

onecomp meta-llama/Llama-2-7b-hf --eval-original

Use a specific GPU

onecomp meta-llama/Llama-2-7b-hf --device cuda:1

Default Behavior

When run with no options, the onecomp command:

  1. Loads the model and tokenizer from Hugging Face Hub
  2. Estimates the target bitwidth from available VRAM
  3. Quantizes with AutoBit (ILP-based mixed-precision) + QEP
  4. Evaluates perplexity (wikitext-2) and zero-shot accuracy
  5. Saves the quantized model to <model_name>-autobit-<X>bit/

Equivalent Python API

The CLI is a thin wrapper around Runner.auto_run. Every CLI invocation maps directly to the Python API:

onecomp meta-llama/Llama-2-7b-hf --wbits 3 --no-qep --save-dir ./output

is equivalent to:

from onecomp import Runner

Runner.auto_run(
    model_id="meta-llama/Llama-2-7b-hf",
    wbits=3,
    qep=False,
    save_dir="./output",
)