Utilities¶
Logging¶
setup_logger ¶
Setup the logger
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
str
|
Logging level for the 'onecomp' logger. |
'INFO'
|
basic
|
str
|
Logging level for the root logger used in |
'WARNING'
|
**kwargs
|
Additional keyword arguments. |
{}
|
|
logfile
|
Name of the log file to write logs to. |
required |
Perplexity Calculation¶
calculate_perplexity ¶
calculate_perplexity(model=None, tokenizer=None, model_config=None, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048)
Calculate perplexity of a Hugging Face Transformers model.
Based on https://huggingface.co/docs/transformers/perplexity
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name
|
str
|
Dataset name (e.g. "wikitext", "allenai/c4"). |
'wikitext'
|
dataset_config
|
str
|
Dataset configuration. - For WikiText: "wikitext-2-raw-v1" - For C4: "en/c4-train.00001-of-01024.json.gz" (treated as data_files) |
'wikitext-2-raw-v1'
|
split
|
str
|
Dataset split (e.g. "test", "train", "validation"). |
'test'
|
max_samples
|
int
|
Maximum number of samples to use. If None, all samples are used. 128 or 512 is recommended for C4. |
None
|
max_length
|
int
|
Maximum length of the sliding window. Default: 2048. Aligned with the standard setting in quantization papers. If None, model.config.max_position_embeddings is used (following the Hugging Face official guide). |
2048
|
stride
|
int
|
Stride of the sliding window. Default: 2048. stride < max_length enables overlapping evaluation (yields lower PPL). If None, set to the same value as max_length (no overlap). The Hugging Face official guide uses stride=512. |
2048
|
Note
Hugging Face official guide parameters (yields lower PPL): max_length=model.config.max_position_embeddings, stride=512 https://huggingface.co/docs/transformers/perplexity
Example
Evaluate on WikiText-2 (default: standard quantization-paper setting)¶
ppl = calculate_perplexity(model=model, tokenizer=tokenizer)
Evaluate following the Hugging Face official guide¶
ppl = calculate_perplexity( ... model=model, tokenizer=tokenizer, ... max_length=model.config.max_position_embeddings, ... stride=512, ... )
Evaluate on C4 (using 128 samples)¶
ppl = calculate_perplexity( ... model=model, ... tokenizer=tokenizer, ... dataset_name="allenai/c4", ... dataset_config="en/c4-train.00001-of-01024.json.gz", ... split="train", ... max_samples=128, ... )
Accuracy Calculation¶
calculate_accuracy ¶
calculate_accuracy(model=None, tokenizer=None, model_config=None, tasks=None, batch_size=8, num_fewshot=0, display_results=True)
Calculate the accuracy of the model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
The model to evaluate. If None, model_config must be provided. |
None
|
|
tokenizer
|
The tokenizer to use. If None, model_config must be provided. |
None
|
|
model_config
|
The model configuration. Used if model or tokenizer is None. |
None
|
|
tasks
|
list
|
The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"] |
None
|
batch_size
|
int
|
The batch size for evaluation. |
8
|
num_fewshot
|
int
|
The number of few-shot examples. |
0
|
display_results
|
bool
|
Whether to display the results. |
True
|
Example
from onecomp import ModelConfig, calculate_accuracy model_config = ModelConfig(model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T") calculate_accuracy(model_config=model_config)
Or with model and tokenizer directly¶
model = model_config.load_model() tokenizer = model_config.load_tokenizer() calculate_accuracy(model=model, tokenizer=tokenizer)
Calibration Dataset¶
prepare_calibration_dataset ¶
prepare_calibration_dataset(tokenizer, device, calibration_dataset=None, max_length=512, num_calibration_samples=128, strategy='drop_rand', seed=0, logger=None)
Prepare calibration data for quantization methods such as GPTQ.
Processing flow
- Obtain data source: use C4 if calibration_dataset is None.
- Chunk the data according to the chosen strategy.
strategy
- "concat_chunk": concatenate all texts -> tokenize at once -> equal-length chunks. Creates as many chunks as possible.
- "concat_chunk_align": concatenate all texts -> tokenize at once -> equal-length chunks. Fixes the number of chunks to num_calibration_samples.
- "drop_head": no cross-document, extract the first max_length tokens. Documents with token length < max_length are discarded.
- "drop_rand": no cross-document, extract a random window of max_length tokens. Documents with token length < max_length are discarded.
About concat_chunk / concat_chunk_align: - No padding needed: all chunks have the same length (max_length). - Data efficiency: even short texts are fully utilized. - Compute efficiency: batch processing is possible. - Caveat: document boundaries become ambiguous (different documents may coexist in a single chunk). - For GPTQ calibration the goal is to collect input-activation statistics (Hessian matrix), so semantic coherence across sentences is not important; therefore this method works well in practice.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
Tokenizer. |
required | |
device
|
device
|
Device to place tensors on (CPU or GPU). |
required |
calibration_dataset
|
list
|
List of texts for calibration. If omitted, the AllenAI C4 dataset is used. |
None
|
max_length
|
int
|
Maximum length of each chunk. |
512
|
num_calibration_samples
|
int
|
Number of samples when using the default dataset. |
128
|
strategy
|
str
|
Calibration data preparation strategy (see above). |
'drop_rand'
|
seed
|
int
|
Random seed for strategy="drop_rand". |
0
|
logger
|
Logger (optional). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Model input dictionary. - "input_ids": tensor of shape (num_chunks, max_length). - "attention_mask": tensor of shape (num_chunks, max_length). |
VRAM Estimation¶
estimate_wbits_from_vram ¶
estimate_wbits_from_vram(model_id: str, vram_ratio: float = 0.8, *, total_vram_gb: float = None, group_size: int = 128, wbits: int = 4, logger=None) -> VRAMBitwidthEstimation
Lightweight VRAM-based bitwidth estimation from a model identifier.
Instantiates the model architecture on a meta device (no weight
data, no GPU/CPU memory) to obtain accurate parameter counts, then
delegates to :func:estimate_target_bitwidth.
This is designed to be called before the full model is loaded,
e.g. in :meth:Runner.auto_run, so that the estimated bitwidth
can be used for output directory naming and passed directly to
AutoBitQuantizer(target_bit=...).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
Hugging Face model ID or local path. |
required |
vram_ratio
|
float
|
Fraction of total VRAM to use (0.0–1.0). |
0.8
|
total_vram_gb
|
float
|
Override GPU VRAM in GB (reads from CUDA if |
None
|
group_size
|
int
|
Quantisation group size for metadata calculation. |
128
|
wbits
|
int
|
Representative bit-width for zero-point metadata estimation. |
4
|
logger
|
Optional logger for diagnostics. |
None
|
Returns:
| Type | Description |
|---|---|
VRAMBitwidthEstimation
|
class: |
VRAMBitwidthEstimation
|
as the raw bpw value (suitable for display and for passing as |
VRAMBitwidthEstimation
|
|
VRAMBitwidthEstimation
dataclass
¶
VRAMBitwidthEstimation(target_bitwidth: float, total_vram_gb: float, budget_gb: float, non_quant_weight_gb: float, available_for_quant_gb: float, total_params: int, quantizable_params: int, meta_bits_per_param: float)
Result of VRAM-based bitwidth estimation.