vLLM Inference¶
OneComp provides vLLM plugins for serving quantized models.
The plugins are automatically registered via Python entry points when onecomp is installed — no extra configuration is needed.
Supported Quantization Methods¶
| Plugin | quant_method |
Description |
|---|---|---|
| DBF | dbf |
1-bit Double Binary Factorization. Uses GemLite kernels by default; set ONECOMP_DBF_NAIVE_LINEAR=1 to use the naive fallback. |
| Mixed-GPTQ | mixed_gptq |
Per-layer mixed-bitwidth GPTQ. Automatically dispatches to Marlin or Exllama kernels based on bit-width and symmetry. |
Rotation-preprocessed models are not supported
Models quantized after rotation preprocessing (prepare_rotated_model) cannot be served with vLLM. vLLM kernels do not apply the online Hadamard transform on down_proj inputs that rotation-preprocessed models require for correct inference.
Installation¶
vLLM is available as an optional dependency:
Use cu130; older CUDA extras are rejected
Recent vLLM releases depend on torch>=2.10, whose wheels are only published for the cu130 PyTorch index. pyproject.toml therefore declares --extra vllm as conflicting with cpu, cu118, cu121, cu124, cu126, and cu128; combining any of those with --extra vllm will fail at lock time. Use --extra cu130 for vLLM workflows.
Note
vLLM requires CUDA and a compatible GPU. See the vLLM documentation for detailed installation instructions and system requirements.
Warning
uv users: Do not install vLLM with uv pip install vllm. Packages installed via uv pip are not tracked by the lockfile and will be removed by subsequent uv sync or uv run commands. Always use --extra vllm instead.
AutoBit + vLLM¶
When using AutoBitQuantizer with mixed-precision candidates (different wbits or groupsize),
the enable_fused_groups parameter must be True (the default since v0.5.1) to ensure vLLM compatibility.
vLLM fuses certain layers into a single linear module during inference:
- qkv_proj:
q_proj+k_proj+v_proj - gate_up_proj:
gate_proj+up_proj
A fused module can only have one quantization configuration (one bit-width, one group size).
When enable_fused_groups=True, the ILP solver constrains fused-layer constituents to share the same quantizer.
enable_fused_groups=False causes vLLM load failures
Setting enable_fused_groups=False allows the ILP to assign different quantizers
(different bits or group sizes) to layers within a fused group. The resulting model
will fail to load in vLLM with an error like:
"Detected some but not all shards of ... are quantized. All shards of fused layers to have the same precision."
Only set enable_fused_groups=False if you do not intend to serve the model with vLLM.
Runner.auto_run() always sets enable_fused_groups=True, so models quantized via auto_run or the CLI are always vLLM-compatible.
Usage¶
1. Quantize and save a model with OneComp¶
from onecomp import Runner, ModelConfig
from onecomp.quantizer.gptq import GPTQ
model_config = ModelConfig(model_id="meta-llama/Llama-3.1-8B-Instruct")
quantizer = GPTQ(wbits=4, groupsize=128)
runner = Runner(model_config=model_config, quantizer=quantizer, qep=True)
runner.run()
runner.save_quantized_model("./Llama-3.1-8B-Instruct-gptq-4bit")
2. Serve with vLLM¶
There are two ways to use the quantized model with vLLM.
Option A: API Server (vllm serve)¶
Launch an OpenAI-compatible HTTP server:
The server starts at http://localhost:8000 by default. You can send requests using curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "./Llama-3.1-8B-Instruct-gptq-4bit",
"messages": [{"role": "user", "content": "What is post-training quantization?"}],
"max_tokens": 128
}'
Or use the OpenAI Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="./Llama-3.1-8B-Instruct-gptq-4bit",
messages=[{"role": "user", "content": "What is post-training quantization?"}],
max_tokens=128,
)
print(response.choices[0].message.content)
Option B: Offline Inference (vllm.LLM)¶
For batch inference without launching a server:
from vllm import LLM, SamplingParams
model_path = "./Llama-3.1-8B-Instruct-gptq-4bit"
llm = LLM(
model=model_path,
max_model_len=2048,
dtype="float16",
enforce_eager=True,
)
outputs = llm.generate(
["What is post-training quantization?"],
SamplingParams(max_tokens=128, temperature=0.0),
)
print(outputs[0].outputs[0].text)
Tip
You do not need to pass quantization= explicitly. vLLM reads the quant_method from the model's config.json and automatically selects the correct OneComp plugin.
Warning
When combining quantization and vLLM inference in a single script, you must wrap your code in if __name__ == "__main__":. vLLM spawns worker processes that re-import the script, so without this guard the quantization step will run again in each child process.
A complete working example (quantization + vLLM inference) is available at
example/vllm_inference/example_gptq_vllm_inference.py.
3. Chat with Open WebUI (optional)¶
Open WebUI provides a ChatGPT-like browser interface. Because vLLM exposes an OpenAI-compatible API, Open WebUI can connect to it directly.
3-1. Start the vLLM server¶
Keep this terminal open. The server listens on http://localhost:8000 by default.
3-2. Launch Open WebUI¶
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
--name open-webui \
ghcr.io/open-webui/open-webui:latest
--add-host allows the container to reach the vLLM server running on the host (required on Linux; macOS/Windows Docker Desktop resolves it automatically).
Open WebUI requires Python 3.11 or 3.12 (3.13+ is not supported). To avoid dependency conflicts with OneComp/vLLM, create a separate virtual environment:
# Create a dedicated venv (uv auto-downloads Python 3.12 if needed)
uv venv ~/open-webui-env --python 3.12
source ~/open-webui-env/bin/activate
# Install and launch
uv pip install open-webui
open-webui serve --port 3000
When done, run deactivate to leave the venv.
To uninstall completely, remove the directory: rm -rf ~/open-webui-env.
Note
The first launch takes several minutes while Open WebUI runs database migrations and downloads an embedding model (~80 MB). Subsequent launches start in seconds.
3-3. Connect to vLLM¶
- Open
http://localhost:3000in your browser. - Create an admin account on first launch.
- Go to Admin Panel → Settings → Connections.
-
Under OpenAI API, set the URL:
Setting Value URL http://host.docker.internal:8000/v1(Docker) orhttp://localhost:8000/v1(pip)API Key dummy(any non-empty string) -
Click Save. The quantized model appears in the model selector.
3-4. Start chatting¶
Select the model from the dropdown at the top of the chat screen and start a conversation.
Tip
Open WebUI persists chat history, supports multiple conversations, and provides features like system prompt customization and temperature control out of the box.
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
ONECOMP_DBF_NAIVE_LINEAR |
0 |
Set to 1 to force the naive (non-GemLite) kernel for DBF inference. Useful for debugging or when GemLite is unavailable. |
Troubleshooting¶
RuntimeError: DeepGEMM backend is not available or outdated¶
vLLM unconditionally runs a DeepGEMM (FP8) kernel warmup at engine startup, even for non-FP8 quantization such as GPTQ, DBF, or Mixed-GPTQ. When the optional deep_gemm package is not installed, the warmup fails with:
RuntimeError: DeepGEMM backend is not available or outdated. Please install or update the `deep_gemm` to a newer version to enable FP8 kernels.
OneComp-quantized models do not require DeepGEMM. Disable the FP8 kernel path before launching vLLM:
export VLLM_USE_DEEP_GEMM=0
export VLLM_DEEP_GEMM_WARMUP=skip
# Then launch vllm as usual
vllm serve ./your-quantized-model
# or
python your_vllm_script.py
Both variables are read directly by vLLM; OneComp does not interpret them.