vLLM Inference¶
OneComp provides vLLM plugins for serving quantized models.
The plugins are automatically registered via Python entry points when onecomp is installed — no extra configuration is needed.
Supported Quantization Methods¶
| Plugin | quant_method |
Description |
|---|---|---|
| DBF | dbf |
1-bit Double Binary Factorization. Uses GemLite kernels by default; set ONECOMP_DBF_NAIVE_LINEAR=1 to use the naive fallback. |
| Mixed-GPTQ | mixed_gptq |
Per-layer mixed-bitwidth GPTQ. Automatically dispatches to Marlin or Exllama kernels based on bit-width and symmetry. |
Rotation-preprocessed models are not supported
Models quantized after rotation preprocessing (prepare_rotated_model) cannot be served with vLLM. vLLM kernels do not apply the online Hadamard transform on down_proj inputs that rotation-preprocessed models require for correct inference.
Installation¶
vLLM is available as an optional dependency:
Note
vLLM requires CUDA and a compatible GPU. See the vLLM documentation for detailed installation instructions and system requirements.
Warning
uv users: Do not install vLLM with uv pip install vllm. Packages installed via uv pip are not tracked by the lockfile and will be removed by subsequent uv sync or uv run commands. Always use --extra vllm instead.
Usage¶
1. Quantize and save a model with OneComp¶
from onecomp import Runner, ModelConfig
from onecomp.quantizer.gptq import GPTQ
model_config = ModelConfig(model_id="meta-llama/Llama-3.1-8B-Instruct")
quantizer = GPTQ(wbits=4, groupsize=128)
runner = Runner(model_config=model_config, quantizer=quantizer, qep=True)
runner.run()
runner.save_quantized_model("./Llama-3.1-8B-Instruct-gptq-4bit")
2. Serve with vLLM¶
There are two ways to use the quantized model with vLLM.
Option A: API Server (vllm serve)¶
Launch an OpenAI-compatible HTTP server:
The server starts at http://localhost:8000 by default. You can send requests using curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "./Llama-3.1-8B-Instruct-gptq-4bit",
"messages": [{"role": "user", "content": "What is post-training quantization?"}],
"max_tokens": 128
}'
Or use the OpenAI Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="./Llama-3.1-8B-Instruct-gptq-4bit",
messages=[{"role": "user", "content": "What is post-training quantization?"}],
max_tokens=128,
)
print(response.choices[0].message.content)
Option B: Offline Inference (vllm.LLM)¶
For batch inference without launching a server:
from vllm import LLM, SamplingParams
model_path = "./Llama-3.1-8B-Instruct-gptq-4bit"
llm = LLM(
model=model_path,
max_model_len=2048,
dtype="float16",
enforce_eager=True,
)
outputs = llm.generate(
["What is post-training quantization?"],
SamplingParams(max_tokens=128, temperature=0.0),
)
print(outputs[0].outputs[0].text)
Tip
You do not need to pass quantization= explicitly. vLLM reads the quant_method from the model's config.json and automatically selects the correct OneComp plugin.
Warning
When combining quantization and vLLM inference in a single script, you must wrap your code in if __name__ == "__main__":. vLLM spawns worker processes that re-import the script, so without this guard the quantization step will run again in each child process.
A complete working example (quantization + vLLM inference) is available at
example/vllm_inference/example_gptq_vllm_inference.py.
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
ONECOMP_DBF_NAIVE_LINEAR |
0 |
Set to 1 to force the naive (non-GemLite) kernel for DBF inference. Useful for debugging or when GemLite is unavailable. |