Configuration¶
This page describes all configurable components in Fujitsu One Compression (OneComp).
ModelConfig¶
ModelConfig wraps model loading and tokenizer initialization.
from onecomp import ModelConfig
model_config = ModelConfig(
model_id="meta-llama/Llama-2-7b-hf",
dtype="float16",
device="cuda:0",
)
| Parameter | Type | Description | Default |
|---|---|---|---|
model_id |
str |
Hugging Face Hub model ID | None |
path |
str |
Local path to model directory | None |
dtype |
str |
Model precision ("float16", "float32") |
"float16" |
device |
str |
Device placement ("cpu", "cuda", "auto") |
"auto" |
Note
Provide exactly one of model_id or path. A ValueError is raised if neither is specified.
Runner¶
Runner is the main entry point for quantization. It manages the full pipeline: loading the model, preparing calibration data, executing quantization, and providing evaluation utilities.
from onecomp import Runner
runner = Runner(
model_config=model_config,
quantizer=quantizer,
max_length=2048,
num_calibration_samples=512,
qep=False,
)
Core Parameters¶
| Parameter | Type | Description | Default |
|---|---|---|---|
model_config |
ModelConfig |
Model and tokenizer configuration | — |
quantizer |
Quantizer |
Quantization method | None |
quantizers |
list[Quantizer] |
Multiple quantizers (for benchmarking) | None |
qep |
bool |
Enable QEP | False |
qep_config |
QEPConfig |
QEP configuration | None |
Calibration Parameters¶
| Parameter | Type | Description | Default |
|---|---|---|---|
calibration_dataset |
Dataset |
Custom calibration dataset | None |
max_length |
int |
Maximum input sequence length | 2048 |
num_calibration_samples |
int |
Number of calibration samples | 512 |
calibration_strategy |
str |
Strategy for preparing calibration inputs | "drop_rand" |
calibration_seed |
int |
Random seed for calibration | 0 |
calibration_batch_size |
int |
Batch size for chunked calibration | None |
num_layers_per_group |
int |
Layers processed simultaneously in chunked mode | 7 |
Advanced Parameters¶
| Parameter | Type | Description | Default |
|---|---|---|---|
multi_gpu |
bool |
Enable multi-GPU layer-wise parallel quantization | False |
gpu_ids |
list[int] |
Specific GPU IDs to use | None |
Calibration Strategies¶
| Strategy | Description |
|---|---|
"drop_rand" |
Tokenize each document independently; take a random window of max_length tokens. |
"drop_head" |
Same, but always take the first max_length tokens. |
"concat_chunk" |
Concatenate all texts, tokenize, and split into fixed-length chunks. |
"concat_chunk_align" |
Same as concat_chunk, but adjusts samples so chunk count equals num_calibration_samples. |
Valid Parameter Combinations¶
quantizers |
qep |
multi_gpu |
calibration_batch_size |
|---|---|---|---|
| Specified | False | False | Specified |
| None | True | False | None |
| None | False | True | None |
| None | False | False | Specified |
| None | False | False | None |
QEPConfig¶
QEPConfig controls Quantization Error Propagation behavior.
from onecomp import QEPConfig
qep_config = QEPConfig(
general=False,
percdamp=0.01,
perccorr=0.5,
device="cuda:0",
exclude_layer_keywords=["mlp.down_proj"],
)
| Parameter | Type | Description | Default |
|---|---|---|---|
general |
bool |
Use generic (architecture-independent) QEP | False |
percdamp |
float |
Damping percentage for Hessian regularization | 0.01 |
perccorr |
float |
Correction percentage for error propagation | 0.5 |
device |
str |
GPU device for QEP computations | "cuda:0" |
exclude_layer_keywords |
list[str] |
Layer keywords excluded from error propagation | ["mlp.down_proj"] |
Tip
The default general=False uses the architecture-aware implementation, which is faster because it exploits shared activations (e.g., QKV layers sharing the same input in Llama-like models).
Quantizer Common Parameters¶
All quantizers inherit from the Quantizer base class and share these parameters:
| Parameter | Type | Description | Default |
|---|---|---|---|
name |
str |
Quantizer name (defaults to class name) | None |
num_layers |
int |
Maximum layers to quantize | None |
calc_quant_error |
bool |
Calculate quantization error per layer | False |
include_layer_names |
list[str] |
Layers to quantize (exact match) | None |
exclude_layer_names |
list[str] |
Layers to skip (exact match) | ["lm_head"] |
include_layer_keywords |
list[str] |
Quantize layers containing any keyword | None |
exclude_layer_keywords |
list[str] |
Skip layers containing any keyword | None |
Layer Selection Priority¶
- Filter by layer type (
target_layer_types) - If
include_layer_namesis set, only include exact matches - If
include_layer_keywordsis set, only include layers containing any keyword - Exclude
exclude_layer_names(exact match) - Exclude
exclude_layer_keywords(keyword match) - Limit by
num_layers