Utilities¶

Logging¶

setup_logger ¶

setup_logger(level: str = 'INFO', basic: str = 'WARNING', filemode: str = 'w', **kwargs)

Setup the logger

Parameters:

Name	Type	Description	Default
`level`	`str`	Logging level for the 'onecomp' logger.	`'INFO'`
`basic`	`str`	Logging level for the root logger used in `basicConfig`.	`'WARNING'`
`**kwargs`		Additional keyword arguments.	`{}`
`logfile`		Name of the log file to write logs to.	required

Perplexity Calculation¶

calculate_perplexity ¶

calculate_perplexity(model=None, tokenizer=None, model_config=None, dataset_name='wikitext', dataset_config='wikitext-2-raw-v1', split='test', max_samples=None, max_length=2048, stride=2048)

Calculate perplexity of a Hugging Face Transformers model.

Based on https://huggingface.co/docs/transformers/perplexity

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	Dataset name (e.g. "wikitext", "allenai/c4").	`'wikitext'`
`dataset_config`	`str`	Dataset configuration. - For WikiText: "wikitext-2-raw-v1" - For C4: "en/c4-train.00001-of-01024.json.gz" (treated as data_files)	`'wikitext-2-raw-v1'`
`split`	`str`	Dataset split (e.g. "test", "train", "validation").	`'test'`
`max_samples`	`int`	Maximum number of samples to use. If None, all samples are used. 128 or 512 is recommended for C4.	`None`
`max_length`	`int`	Maximum length of the sliding window. Default: 2048. Aligned with the standard setting in quantization papers. If None, model.config.max_position_embeddings is used (following the Hugging Face official guide).	`2048`
`stride`	`int`	Stride of the sliding window. Default: 2048. stride < max_length enables overlapping evaluation (yields lower PPL). If None, set to the same value as max_length (no overlap). The Hugging Face official guide uses stride=512.	`2048`

Note

Hugging Face official guide parameters (yields lower PPL): max_length=model.config.max_position_embeddings, stride=512 https://huggingface.co/docs/transformers/perplexity

Example

Evaluate on WikiText-2 (default: standard quantization-paper setting)¶

ppl = calculate_perplexity(model=model, tokenizer=tokenizer)

Evaluate following the Hugging Face official guide¶

ppl = calculate_perplexity( ... model=model, tokenizer=tokenizer, ... max_length=model.config.max_position_embeddings, ... stride=512, ... )

Evaluate on C4 (using 128 samples)¶

ppl = calculate_perplexity( ... model=model, ... tokenizer=tokenizer, ... dataset_name="allenai/c4", ... dataset_config="en/c4-train.00001-of-01024.json.gz", ... split="train", ... max_samples=128, ... )

Accuracy Calculation¶

calculate_accuracy ¶

calculate_accuracy(model=None, tokenizer=None, model_config=None, tasks=None, batch_size=8, num_fewshot=0, display_results=True)

Calculate the accuracy of the model

Parameters:

Name	Type	Description	Default
`model`		The model to evaluate. If None, model_config must be provided.	`None`
`tokenizer`		The tokenizer to use. If None, model_config must be provided.	`None`
`model_config`		The model configuration. Used if model or tokenizer is None.	`None`
`tasks`	`list`	The list of tasks to evaluate. Default: ["arc_easy", "arc_challenge", "piqa", "winogrande"]	`None`
`batch_size`	`int`	The batch size for evaluation.	`8`
`num_fewshot`	`int`	The number of few-shot examples.	`0`
`display_results`	`bool`	Whether to display the results.	`True`

Example

from onecomp import ModelConfig, calculate_accuracy model_config = ModelConfig(model_id="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T") calculate_accuracy(model_config=model_config)

Or with model and tokenizer directly¶

model = model_config.load_model() tokenizer = model_config.load_tokenizer() calculate_accuracy(model=model, tokenizer=tokenizer)

Calibration Dataset¶

prepare_calibration_dataset ¶

prepare_calibration_dataset(tokenizer, device, calibration_dataset=None, max_length=512, num_calibration_samples=128, strategy='drop_rand', seed=0, logger=None)

Prepare calibration data for quantization methods such as GPTQ.

Processing flow

Obtain data source: use C4 if calibration_dataset is None.
Chunk the data according to the chosen strategy.

strategy

"concat_chunk": concatenate all texts -> tokenize at once -> equal-length chunks. Creates as many chunks as possible.
"concat_chunk_align": concatenate all texts -> tokenize at once -> equal-length chunks. Fixes the number of chunks to num_calibration_samples.
"drop_head": no cross-document, extract the first max_length tokens. Documents with token length < max_length are discarded.
"drop_rand": no cross-document, extract a random window of max_length tokens. Documents with token length < max_length are discarded.

About concat_chunk / concat_chunk_align: - No padding needed: all chunks have the same length (max_length). - Data efficiency: even short texts are fully utilized. - Compute efficiency: batch processing is possible. - Caveat: document boundaries become ambiguous (different documents may coexist in a single chunk). - For GPTQ calibration the goal is to collect input-activation statistics (Hessian matrix), so semantic coherence across sentences is not important; therefore this method works well in practice.

Parameters:

Name	Type	Description	Default
`tokenizer`		Tokenizer.	required
`device`	`device`	Device to place tensors on (CPU or GPU).	required
`calibration_dataset`	`list`	List of texts for calibration. If omitted, the AllenAI C4 dataset is used.	`None`
`max_length`	`int`	Maximum length of each chunk.	`512`
`num_calibration_samples`	`int`	Number of samples when using the default dataset.	`128`
`strategy`	`str`	Calibration data preparation strategy (see above).	`'drop_rand'`
`seed`	`int`	Random seed for strategy="drop_rand".	`0`
`logger`		Logger (optional).	`None`

Returns:

Name	Type	Description
`dict`		Model input dictionary. - "input_ids": tensor of shape (num_chunks, max_length). - "attention_mask": tensor of shape (num_chunks, max_length).

VRAM Estimation¶

estimate_wbits_from_vram ¶

estimate_wbits_from_vram(model_id: str, vram_ratio: float = 0.8, *, total_vram_gb: float = None, group_size: int = 128, wbits: int = 4, logger=None) -> VRAMBitwidthEstimation

Lightweight VRAM-based bitwidth estimation from a model identifier.

Instantiates the model architecture on a meta device (no weight data, no GPU/CPU memory) to obtain accurate parameter counts, then delegates to :func:estimate_target_bitwidth.

This is designed to be called before the full model is loaded, e.g. in :meth:Runner.auto_run, so that the estimated bitwidth can be used for output directory naming and passed directly to AutoBitQuantizer(target_bit=...).

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Hugging Face model ID or local path.	required
`vram_ratio`	`float`	Fraction of total VRAM to use (0.0–1.0).	`0.8`
`total_vram_gb`	`float`	Override GPU VRAM in GB (reads from CUDA if `None`).	`None`
`group_size`	`int`	Quantisation group size for metadata calculation.	`128`
`wbits`	`int`	Representative bit-width for zero-point metadata estimation.	`4`
`logger`		Optional logger for diagnostics.	`None`

Returns:

Type	Description
`VRAMBitwidthEstimation`	class:`VRAMBitwidthEstimation` — use `result.target_bitwidth`
`VRAMBitwidthEstimation`	as the raw bpw value (suitable for display and for passing as
`VRAMBitwidthEstimation`	`target_bit` to `AutoBitQuantizer`).

VRAMBitwidthEstimation `dataclass` ¶

VRAMBitwidthEstimation(target_bitwidth: float, total_vram_gb: float, budget_gb: float, non_quant_weight_gb: float, available_for_quant_gb: float, total_params: int, quantizable_params: int, meta_bits_per_param: float)

Result of VRAM-based bitwidth estimation.

Utilities¶

Logging¶

setup_logger ¶

Perplexity Calculation¶

calculate_perplexity ¶

Evaluate on WikiText-2 (default: standard quantization-paper setting)¶

Evaluate following the Hugging Face official guide¶

Evaluate on C4 (using 128 samples)¶

Accuracy Calculation¶

calculate_accuracy ¶

Or with model and tokenizer directly¶

Calibration Dataset¶

prepare_calibration_dataset ¶

VRAM Estimation¶

estimate_wbits_from_vram ¶

VRAMBitwidthEstimation dataclass ¶

VRAMBitwidthEstimation `dataclass` ¶