Evaluation¶

Hydra-driven evaluation harness for models served through vLLM (onecomp-eval CLI).

For installation, MT-Bench data download, judge API keys, and CLI examples, see the Evaluation user guide.

ModelConfig here is eval-specific

onecomp.eval.ModelConfig only holds the model path/name for evaluation. It is not the same as onecomp.ModelConfig used for quantization.

Package exports¶

eval ¶

OneComp Evaluation Harness.

End-to-end evaluation pipeline:

run_evaluate.main is the Hydra-driven entry point (CLI: onecomp-eval). It manages vLLM server lifecycle, dispatches each enabled evaluator as a subprocess, and aggregates the per-evaluator result files into a unified summary.
Evaluators live under onecomp.eval.evals and each exposes a child run.py that the orchestrator launches via python -m. Result files conform to TaskResult.

Currently supported evaluators:

mt_bench -- MT-Bench (default: English), full pipeline

EvalConfig `dataclass` ¶

EvalConfig(inference: InferenceConfig = InferenceConfig(), model: ModelConfig = ModelConfig(), evals: EvalsConfig = EvalsConfig(), summary: SummaryConfig = SummaryConfig(), output_dir: str = './output/eval', log_level: str = 'INFO')

Root configuration consumed by run_evaluate.main.

EvalsConfig `dataclass` ¶

EvalsConfig(mt_bench: MtBenchConfig = MtBenchConfig(), throughput: ThroughputConfig = ThroughputConfig())

Container for all evaluator-specific configs.

Add new evaluators by appending a field here and an entry under evals in conf/eval_config.yaml.

ModelConfig `dataclass` ¶

ModelConfig(path: str = '???', name: Optional[str] = None)

Model under evaluation.

MtBenchConfig `dataclass` ¶

MtBenchConfig(enabled: bool = True, data_dir: str = '', judge_model: str = 'gpt-4o-2024-08-06', judge_api_base: str = '', max_new_tokens: int = 1024, request_timeout_sec: int = 600, plot: bool = True, chart_path: str = '', subprocess_timeout_sec: int = 7200)

MT-Bench evaluator settings (kept simple; categories are dataset-driven).

ThroughputConfig `dataclass` ¶

ThroughputConfig(enabled: bool = False, prompt_tokens: int = 512, max_tokens: int = 512, num_warmup: int = 2, num_trials: int = 5, temperature: float = 0.0, prompt_seed_text: str = 'This is a fixed prompt for throughput benchmarking. It compares decode performance of quantized models under the same conditions.', save_responses: bool = True, save_warmup_responses: bool = False, min_completion_tokens: int = 32, request_timeout_sec: int = 600, subprocess_timeout_sec: int = 1800)

vLLM serving throughput via Chat Completions (streaming).

Uses a fixed-length synthetic user prompt and measures TTFT, ITL (TPOT), TPS per user, TPS decode, and aggregate TPS/RPS over the measured window. Independent of MT-Bench max_new_tokens.

InferenceConfig `dataclass` ¶

InferenceConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())

vLLM OpenAI-compatible HTTP server launch parameters.

VllmServerConfig `dataclass` ¶

VllmServerConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())

Bases: InferenceConfig

Alias kept for the public API.

SummaryConfig `dataclass` ¶

SummaryConfig(include: Optional[list[str]] = None, formats: list[str] = (lambda: ['json', 'csv'])())

Aggregator settings.

include controls which evaluators are rolled into the final summary file; None means "include every evaluator that ran successfully".

TaskResult `dataclass` ¶

TaskResult(eval_name: str, status: Literal['success', 'failed', 'skipped'] = 'success', model: str = '', timestamp: str = '', scores: dict[str, Any] = dict(), artifacts: dict[str, str] = dict(), metadata: dict[str, Any] = dict(), error: str = '')

Per-evaluator result written by each subprocess.

Aggregator picks up files at //result.json that conform to this schema and rolls them into a summary.

VllmServerManager ¶

VllmServerManager(cfg: InferenceConfig, model_path: str | Path, log_dir: str | Path)

Context manager that owns a vLLM HTTP server lifecycle.

start ¶

start() -> None

Spawn the vLLM HTTP server and wait until /health is 200.

stop ¶

stop() -> None

Shut the server down gracefully, then force-kill if needed.

run_pipeline ¶

run_pipeline(cfg: DictConfig) -> dict

Execute the configured evaluation pipeline.

Parameters:

Name	Type	Description	Default
`cfg`	`DictConfig`	Resolved Hydra config (root EvalConfig).	required

Returns:

Type	Description
`dict`	The aggregator summary dict (also written to summary.json).

aggregate_results ¶

aggregate_results(*, output_dir: Path, results: Iterable[TaskResult], include: list[str] | None = None, formats: list[str] | None = None) -> dict

Collate per-evaluator results into a summary dict (and write files).

Parameters:

Name	Type	Description	Default
`output_dir`	`Path`	Run-level output directory.	required
`results`	`Iterable[TaskResult]`	Iterable of TaskResult.	required
`include`	`list[str] \| None`	Optional whitelist of eval_name to fold into the summary; None keeps every successful result.	`None`
`formats`	`list[str] \| None`	Output formats ("json", "csv"). Defaults to both.	`None`

Returns:

Type	Description
`dict`	The summary dict that is also written to disk.

run_subprocess_eval ¶

run_subprocess_eval(*, eval_name: str, module: str, task_cfg: Any, model_name: str, output_root: Path, env_overrides: dict[str, str] | None = None, timeout_sec: int = 7200) -> TaskResult

Run one evaluator as a subprocess and return its TaskResult.

Parameters:

Name	Type	Description	Default
`eval_name`	`str`	Short identifier (e.g. "mt_bench"). Used for the per-eval output directory.	required
`module`	`str`	Dotted module path of the child entrypoint (e.g. "onecomp.eval.evals.mt_bench.run").	required
`task_cfg`	`Any`	Subtree of the Hydra config for this evaluator. Any omegaconf.DictConfig or plain dataclass works.	required
`model_name`	`str`	Model identifier (forwarded to the child for output file naming).	required
`output_root`	`Path`	Parent output directory; the child writes under //.	required
`env_overrides`	`dict[str, str] \| None`	Additional environment variables passed to the child (e.g. OPENAI_BASE_URL for the vLLM HTTP endpoint).	`None`
`timeout_sec`	`int`	Wall-clock limit. -1 disables.	`7200`

Returns:

Type	Description
`TaskResult`	TaskResult. status is either the child's reported
`TaskResult`	value or "failed" if the child crashed.

Configuration schema¶

EvalConfig `dataclass` ¶

EvalConfig(inference: InferenceConfig = InferenceConfig(), model: ModelConfig = ModelConfig(), evals: EvalsConfig = EvalsConfig(), summary: SummaryConfig = SummaryConfig(), output_dir: str = './output/eval', log_level: str = 'INFO')

Root configuration consumed by run_evaluate.main.

InferenceConfig `dataclass` ¶

InferenceConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())

vLLM OpenAI-compatible HTTP server launch parameters.

MtBenchConfig `dataclass` ¶

MtBenchConfig(enabled: bool = True, data_dir: str = '', judge_model: str = 'gpt-4o-2024-08-06', judge_api_base: str = '', max_new_tokens: int = 1024, request_timeout_sec: int = 600, plot: bool = True, chart_path: str = '', subprocess_timeout_sec: int = 7200)

MT-Bench evaluator settings (kept simple; categories are dataset-driven).

ThroughputConfig `dataclass` ¶

ThroughputConfig(enabled: bool = False, prompt_tokens: int = 512, max_tokens: int = 512, num_warmup: int = 2, num_trials: int = 5, temperature: float = 0.0, prompt_seed_text: str = 'This is a fixed prompt for throughput benchmarking. It compares decode performance of quantized models under the same conditions.', save_responses: bool = True, save_warmup_responses: bool = False, min_completion_tokens: int = 32, request_timeout_sec: int = 600, subprocess_timeout_sec: int = 1800)

vLLM serving throughput via Chat Completions (streaming).

Uses a fixed-length synthetic user prompt and measures TTFT, ITL (TPOT), TPS per user, TPS decode, and aggregate TPS/RPS over the measured window. Independent of MT-Bench max_new_tokens.

EvalsConfig `dataclass` ¶

EvalsConfig(mt_bench: MtBenchConfig = MtBenchConfig(), throughput: ThroughputConfig = ThroughputConfig())

Container for all evaluator-specific configs.

Add new evaluators by appending a field here and an entry under evals in conf/eval_config.yaml.

SummaryConfig `dataclass` ¶

SummaryConfig(include: Optional[list[str]] = None, formats: list[str] = (lambda: ['json', 'csv'])())

Aggregator settings.

include controls which evaluators are rolled into the final summary file; None means "include every evaluator that ran successfully".

TaskResult¶

Per-evaluator subprocess output written to <output_dir>/<eval_name>/result.json.

TaskResult `dataclass` ¶

TaskResult(eval_name: str, status: Literal['success', 'failed', 'skipped'] = 'success', model: str = '', timestamp: str = '', scores: dict[str, Any] = dict(), artifacts: dict[str, str] = dict(), metadata: dict[str, Any] = dict(), error: str = '')

Per-evaluator result written by each subprocess.

Aggregator picks up files at //result.json that conform to this schema and rolls them into a summary.

create `classmethod` ¶

create(eval_name: str, model: str, **kwargs: Any) -> 'TaskResult'

save ¶

save(path: str | Path) -> Path

load `classmethod` ¶

load(path: str | Path) -> 'TaskResult'

Orchestration¶

run_pipeline ¶

run_pipeline(cfg: DictConfig) -> dict

Execute the configured evaluation pipeline.

Parameters:

Name	Type	Description	Default
`cfg`	`DictConfig`	Resolved Hydra config (root EvalConfig).	required

Returns:

Type	Description
`dict`	The aggregator summary dict (also written to summary.json).

VllmServerManager ¶

VllmServerManager(cfg: InferenceConfig, model_path: str | Path, log_dir: str | Path)

Context manager that owns a vLLM HTTP server lifecycle.

start ¶

start() -> None

Spawn the vLLM HTTP server and wait until /health is 200.

stop ¶

stop() -> None

Shut the server down gracefully, then force-kill if needed.

run_subprocess_eval ¶

run_subprocess_eval(*, eval_name: str, module: str, task_cfg: Any, model_name: str, output_root: Path, env_overrides: dict[str, str] | None = None, timeout_sec: int = 7200) -> TaskResult

Run one evaluator as a subprocess and return its TaskResult.

Parameters:

Name	Type	Description	Default
`eval_name`	`str`	Short identifier (e.g. "mt_bench"). Used for the per-eval output directory.	required
`module`	`str`	Dotted module path of the child entrypoint (e.g. "onecomp.eval.evals.mt_bench.run").	required
`task_cfg`	`Any`	Subtree of the Hydra config for this evaluator. Any omegaconf.DictConfig or plain dataclass works.	required
`model_name`	`str`	Model identifier (forwarded to the child for output file naming).	required
`output_root`	`Path`	Parent output directory; the child writes under //.	required
`env_overrides`	`dict[str, str] \| None`	Additional environment variables passed to the child (e.g. OPENAI_BASE_URL for the vLLM HTTP endpoint).	`None`
`timeout_sec`	`int`	Wall-clock limit. -1 disables.	`7200`

Returns:

Type	Description
`TaskResult`	TaskResult. status is either the child's reported
`TaskResult`	value or "failed" if the child crashed.

aggregate_results ¶

aggregate_results(*, output_dir: Path, results: Iterable[TaskResult], include: list[str] | None = None, formats: list[str] | None = None) -> dict

Collate per-evaluator results into a summary dict (and write files).

Parameters:

Name	Type	Description	Default
`output_dir`	`Path`	Run-level output directory.	required
`results`	`Iterable[TaskResult]`	Iterable of TaskResult.	required
`include`	`list[str] \| None`	Optional whitelist of eval_name to fold into the summary; None keeps every successful result.	`None`
`formats`	`list[str] \| None`	Output formats ("json", "csv"). Defaults to both.	`None`

Returns:

Type	Description
`dict`	The summary dict that is also written to disk.

Evaluation¶

Package exports¶

eval ¶

EvalConfig dataclass ¶

EvalsConfig dataclass ¶

ModelConfig dataclass ¶

MtBenchConfig dataclass ¶

ThroughputConfig dataclass ¶

InferenceConfig dataclass ¶

VllmServerConfig dataclass ¶

SummaryConfig dataclass ¶

TaskResult dataclass ¶

VllmServerManager ¶

start ¶

stop ¶

run_pipeline ¶

aggregate_results ¶

run_subprocess_eval ¶

Configuration schema¶

EvalConfig dataclass ¶

InferenceConfig dataclass ¶

MtBenchConfig dataclass ¶

ThroughputConfig dataclass ¶

EvalsConfig dataclass ¶

SummaryConfig dataclass ¶

TaskResult¶

TaskResult dataclass ¶

create classmethod ¶

save ¶

load classmethod ¶

Orchestration¶

run_pipeline ¶

VllmServerManager ¶

start ¶

stop ¶

run_subprocess_eval ¶

aggregate_results ¶

EvalConfig `dataclass` ¶

EvalsConfig `dataclass` ¶

ModelConfig `dataclass` ¶

MtBenchConfig `dataclass` ¶

ThroughputConfig `dataclass` ¶

InferenceConfig `dataclass` ¶

VllmServerConfig `dataclass` ¶

SummaryConfig `dataclass` ¶

TaskResult `dataclass` ¶

EvalConfig `dataclass` ¶

InferenceConfig `dataclass` ¶

MtBenchConfig `dataclass` ¶

ThroughputConfig `dataclass` ¶

EvalsConfig `dataclass` ¶

SummaryConfig `dataclass` ¶

TaskResult `dataclass` ¶

create `classmethod` ¶

load `classmethod` ¶