Skip to content

Utilities

Logging

setup_logger

setup_logger(level: str = 'INFO', basic: str = 'WARNING', filemode: str = 'w', **kwargs)

Setup the logger

Parameters:

Name Type Description Default
level str

Logging level for the 'onecomp' logger.

'INFO'
basic str

Logging level for the root logger used in basicConfig.

'WARNING'
**kwargs

Additional keyword arguments.

{}
logfile

Name of the log file to write logs to.

required

Perplexity Calculation

calculate_perplexity

calculate_perplexity(model=None, tokenizer=None, model_config=None, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048)

Calculate perplexity of a Hugging Face Transformers model.

Based on https://huggingface.co/docs/transformers/perplexity

Parameters:

Name Type Description Default
dataset_name str

Dataset name (e.g. "wikitext", "allenai/c4").

'wikitext'
dataset_config str

Dataset configuration. - For WikiText: "wikitext-2-raw-v1" - For C4: "en/c4-train.00001-of-01024.json.gz" (treated as data_files)

'wikitext-2-raw-v1'
split str

Dataset split (e.g. "test", "train", "validation").

'test'
max_samples int

Maximum number of samples to use. If None, all samples are used. 128 or 512 is recommended for C4.

None
max_length int

Maximum length of the sliding window. Default: 2048. Aligned with the standard setting in quantization papers. If None, model.config.max_position_embeddings is used (following the Hugging Face official guide).

2048
stride int

Stride of the sliding window. Default: 2048. stride < max_length enables overlapping evaluation (yields lower PPL). If None, set to the same value as max_length (no overlap). The Hugging Face official guide uses stride=512.

2048
Note

Hugging Face official guide parameters (yields lower PPL): max_length=model.config.max_position_embeddings, stride=512 https://huggingface.co/docs/transformers/perplexity

Example

Evaluate on WikiText-2 (default: standard quantization-paper setting)

ppl = calculate_perplexity(model=model, tokenizer=tokenizer)

Evaluate following the Hugging Face official guide

ppl = calculate_perplexity( ... model=model, tokenizer=tokenizer, ... max_length=model.config.max_position_embeddings, ... stride=512, ... )

Evaluate on C4 (using 128 samples)

ppl = calculate_perplexity( ... model=model, ... tokenizer=tokenizer, ... dataset_name="allenai/c4", ... dataset_config="en/c4-train.00001-of-01024.json.gz", ... split="train", ... max_samples=128, ... )

Accuracy Calculation

calculate_accuracy

calculate_accuracy(model=None, tokenizer=None, model_config=None, tasks=None, batch_size=8, num_fewshot=0, display_results=True)

Calculate the accuracy of the model

Parameters:

Name Type Description Default
model

The model to evaluate. If None, model_config must be provided.

None
tokenizer

The tokenizer to use. If None, model_config must be provided.

None
model_config

The model configuration. Used if model or tokenizer is None.

None
tasks list

The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"]

None
batch_size int

The batch size for evaluation.

8
num_fewshot int

The number of few-shot examples.

0
display_results bool

Whether to display the results.

True
Example

from onecomp import ModelConfig, calculate_accuracy model_config = ModelConfig(model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T") calculate_accuracy(model_config=model_config)

Or with model and tokenizer directly

model = model_config.load_model() tokenizer = model_config.load_tokenizer() calculate_accuracy(model=model, tokenizer=tokenizer)

Calibration Dataset

prepare_calibration_dataset

prepare_calibration_dataset(tokenizer, device, calibration_dataset=None, max_length=512, num_calibration_samples=128, strategy='drop_rand', seed=0, logger=None)

Prepare calibration data for quantization methods such as GPTQ.

Processing flow
  1. Obtain data source: use C4 if calibration_dataset is None.
  2. Chunk the data according to the chosen strategy.
strategy
  • "concat_chunk": concatenate all texts -> tokenize at once -> equal-length chunks. Creates as many chunks as possible.
  • "concat_chunk_align": concatenate all texts -> tokenize at once -> equal-length chunks. Fixes the number of chunks to num_calibration_samples.
  • "drop_head": no cross-document, extract the first max_length tokens. Documents with token length < max_length are discarded.
  • "drop_rand": no cross-document, extract a random window of max_length tokens. Documents with token length < max_length are discarded.

About concat_chunk / concat_chunk_align: - No padding needed: all chunks have the same length (max_length). - Data efficiency: even short texts are fully utilized. - Compute efficiency: batch processing is possible. - Caveat: document boundaries become ambiguous (different documents may coexist in a single chunk). - For GPTQ calibration the goal is to collect input-activation statistics (Hessian matrix), so semantic coherence across sentences is not important; therefore this method works well in practice.

Parameters:

Name Type Description Default
tokenizer

Tokenizer.

required
device device

Device to place tensors on (CPU or GPU).

required
calibration_dataset list

List of texts for calibration. If omitted, the AllenAI C4 dataset is used.

None
max_length int

Maximum length of each chunk.

512
num_calibration_samples int

Number of samples when using the default dataset.

128
strategy str

Calibration data preparation strategy (see above).

'drop_rand'
seed int

Random seed for strategy="drop_rand".

0
logger

Logger (optional).

None

Returns:

Name Type Description
dict

Model input dictionary. - "input_ids": tensor of shape (num_chunks, max_length). - "attention_mask": tensor of shape (num_chunks, max_length).

VRAM Estimation

estimate_wbits_from_vram

estimate_wbits_from_vram(model_id: str, vram_ratio: float = 0.8, *, total_vram_gb: float = None, group_size: int = 128, wbits: int = 4, logger=None) -> VRAMBitwidthEstimation

Lightweight VRAM-based bitwidth estimation from a model identifier.

Instantiates the model architecture on a meta device (no weight data, no GPU/CPU memory) to obtain accurate parameter counts, then delegates to :func:estimate_target_bitwidth.

This is designed to be called before the full model is loaded, e.g. in :meth:Runner.auto_run, so that the estimated bitwidth can be used for output directory naming and passed directly to AutoBitQuantizer(target_bit=...).

Parameters:

Name Type Description Default
model_id str

Hugging Face model ID or local path.

required
vram_ratio float

Fraction of total VRAM to use (0.0–1.0).

0.8
total_vram_gb float

Override GPU VRAM in GB (reads from CUDA if None).

None
group_size int

Quantisation group size for metadata calculation.

128
wbits int

Representative bit-width for zero-point metadata estimation.

4
logger

Optional logger for diagnostics.

None

Returns:

Type Description
VRAMBitwidthEstimation

class:VRAMBitwidthEstimation — use result.target_bitwidth

VRAMBitwidthEstimation

as the raw bpw value (suitable for display and for passing as

VRAMBitwidthEstimation

target_bit to AutoBitQuantizer).

VRAMBitwidthEstimation dataclass

VRAMBitwidthEstimation(target_bitwidth: float, total_vram_gb: float, budget_gb: float, non_quant_weight_gb: float, available_for_quant_gb: float, total_params: int, quantizable_params: int, meta_bits_per_param: float)

Result of VRAM-based bitwidth estimation.