macOS / Apple Silicon (MPS)¶
OneComp supports GPTQ quantization and Hugging Face generate() inference on macOS
with Apple Silicon GPUs via PyTorch MPS (device="mps").
vLLM serving and GemLite kernels require Linux with an NVIDIA GPU. On Mac, use Transformers inference after saving a quantized model.
Supported Features¶
With device="mps", calibration and model forward passes can use MPS, but several
quantization steps intentionally run on CPU for performance. See
MPS Device Placement (GPTQ vs QEP) for details.
| Feature | Supported | Where it runs on MPS |
|---|---|---|
| GPTQ quantization | Yes | Forwards on MPS; column-wise GPTQ loop (incl. inverse-Hessian Cholesky) on CPU |
| AutoBit (GPTQ-only candidates) | Yes | Same as GPTQ for per-layer quantization; bitwidth assignment (ILP) on CPU |
| QEP (Quantization Error Propagation) | Yes | Per-layer correction (e.g. weight @ delta_hatX) on MPS; Cholesky solve on CPU (once per layer) |
| Calibration forward passes | Yes | MPS |
load_quantized_model + generate() |
Yes | MPS (when available) |
| vLLM / GemLite serving | No | Linux + CUDA only |
| DBF, RTN, JointQ, and other quantizers | No | — |
| AutoBit DBF fallback | No | — |
| Multi-GPU quantization | No | — |
Installation¶
See Installation for
pip and uv setup. On macOS:
- pip users: install PyTorch from PyPI (default wheels include MPS).
- uv users: run
uv sync --extra mps --extra dev(do not use CUDA extras or--extra cpu).
Verify MPS:
Quick Start¶
Runner.auto_run and the onecomp CLI default to device="cuda:0". On Mac you must
set device="mps" explicitly.
VRAM auto-detection (wbits=None without total_vram_gb) queries CUDA only. On MPS,
pass total_vram_gb with your unified memory budget (e.g. 16 or 32):
For manual configuration:
from onecomp import ModelConfig, Runner, GPTQ, setup_logger
setup_logger()
model_config = ModelConfig(
model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
device="mps",
)
gptq = GPTQ(wbits=4, groupsize=128)
runner = Runner(model_config=model_config, quantizer=gptq, qep=True)
runner.run()
MPS Device Placement (GPTQ vs QEP)¶
With device="mps", calibration and model forward passes can run on the GPU.
The implementation splits work as follows:
-
GPTQ (
run_gptq): The Hessian and weights are moved to CPU for the full column-wise loop (including inverse-Hessian Cholesky). If that loop stayed on MPS,quantize()would callmaxq.item()once per column; each call triggers per-column host sync (wait for pending MPS ops, then read one scalar—not a full Hessian/weight copy every column)—often several times slower than CPU on Apple Silicon. Keeping GPTQ on CPU avoids that overhead. Withmse=True,find_paramsalso callsquantize()in a grid loop and benefits from the same CPU placement. -
QEP weight correction (
adjust_weight, when QEP runs—typicallyqep=True): Per-layer work stays on MPS (e.g.weight @ delta_hatX, diagonal damping). Only the Cholesky solve uses CPU via_safe_cholesky_and_solve(one solve per layer, not per column). The subsequent GPTQ step still uses the CPU path above.
Inference with Transformers¶
After quantization, load the saved model and run text generation with Transformers.
load_quantized_model() places the model on MPS automatically when available
(via get_default_device()).
from onecomp import load_quantized_model
model, tokenizer = load_quantized_model("./output/quantized_model")
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For high-throughput serving or Open WebUI integration, use vLLM on a Linux machine with an NVIDIA GPU. See the vLLM Inference guide.
Limitations and Validation¶
Runner.check() enforces MPS constraints before quantization:
- Only GPTQ quantizers are allowed (or AutoBitQuantizer whose candidates are all GPTQ).
- AutoBit DBF fallback is rejected when the target bitwidth would require DBF-only assignment.
multi_gpu=Trueis not supported.
To avoid DBF fallback on MPS, either set an explicit wbits within the GPTQ candidate range
or ensure VRAM estimation yields a bitwidth that does not trigger DBF-only paths.
See Also¶
- Installation — macOS pip /
uvsetup - CLI Reference —
--device mpsand--total-vram-gb - Configuration —
ModelConfig.device - Examples — save / load workflow