Skip to content

Evaluation

Hydra-driven evaluation harness for models served through vLLM (onecomp-eval CLI).

For installation, MT-Bench data download, judge API keys, and CLI examples, see the Evaluation user guide.

ModelConfig here is eval-specific

onecomp.eval.ModelConfig only holds the model path/name for evaluation. It is not the same as onecomp.ModelConfig used for quantization.

Package exports

eval

OneComp Evaluation Harness.

End-to-end evaluation pipeline:

  • run_evaluate.main is the Hydra-driven entry point (CLI: onecomp-eval). It manages vLLM server lifecycle, dispatches each enabled evaluator as a subprocess, and aggregates the per-evaluator result files into a unified summary.
  • Evaluators live under onecomp.eval.evals and each exposes a child run.py that the orchestrator launches via python -m. Result files conform to TaskResult.

Currently supported evaluators:

  • mt_bench -- MT-Bench (default: English), full pipeline

Copyright 2025-2026 Fujitsu Ltd.

EvalConfig dataclass

EvalConfig(inference: InferenceConfig = InferenceConfig(), model: ModelConfig = ModelConfig(), evals: EvalsConfig = EvalsConfig(), summary: SummaryConfig = SummaryConfig(), output_dir: str = './output/eval', log_level: str = 'INFO')

Root configuration consumed by run_evaluate.main.

EvalsConfig dataclass

EvalsConfig(mt_bench: MtBenchConfig = MtBenchConfig(), throughput: ThroughputConfig = ThroughputConfig())

Container for all evaluator-specific configs.

Add new evaluators by appending a field here and an entry under evals in conf/eval_config.yaml.

ModelConfig dataclass

ModelConfig(path: str = '???', name: Optional[str] = None)

Model under evaluation.

MtBenchConfig dataclass

MtBenchConfig(enabled: bool = True, data_dir: str = '', judge_model: str = 'gpt-4o-2024-08-06', judge_api_base: str = '', max_new_tokens: int = 1024, request_timeout_sec: int = 600, plot: bool = True, chart_path: str = '', subprocess_timeout_sec: int = 7200)

MT-Bench evaluator settings (kept simple; categories are dataset-driven).

ThroughputConfig dataclass

ThroughputConfig(enabled: bool = False, prompt_tokens: int = 512, max_tokens: int = 512, num_warmup: int = 2, num_trials: int = 5, temperature: float = 0.0, prompt_seed_text: str = 'This is a fixed prompt for throughput benchmarking. It compares decode performance of quantized models under the same conditions.', save_responses: bool = True, save_warmup_responses: bool = False, min_completion_tokens: int = 32, request_timeout_sec: int = 600, subprocess_timeout_sec: int = 1800)

vLLM serving throughput via Chat Completions (streaming).

Uses a fixed-length synthetic user prompt and measures TTFT, ITL (TPOT), TPS per user, TPS decode, and aggregate TPS/RPS over the measured window. Independent of MT-Bench max_new_tokens.

InferenceConfig dataclass

InferenceConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())

vLLM OpenAI-compatible HTTP server launch parameters.

VllmServerConfig dataclass

VllmServerConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())

Bases: InferenceConfig

Alias kept for the public API.

SummaryConfig dataclass

SummaryConfig(include: Optional[list[str]] = None, formats: list[str] = (lambda: ['json', 'csv'])())

Aggregator settings.

include controls which evaluators are rolled into the final summary file; None means "include every evaluator that ran successfully".

TaskResult dataclass

TaskResult(eval_name: str, status: Literal['success', 'failed', 'skipped'] = 'success', model: str = '', timestamp: str = '', scores: dict[str, Any] = dict(), artifacts: dict[str, str] = dict(), metadata: dict[str, Any] = dict(), error: str = '')

Per-evaluator result written by each subprocess.

Aggregator picks up files at //result.json that conform to this schema and rolls them into a summary.

VllmServerManager

VllmServerManager(cfg: InferenceConfig, model_path: str | Path, log_dir: str | Path)

Context manager that owns a vLLM HTTP server lifecycle.

start

start() -> None

Spawn the vLLM HTTP server and wait until /health is 200.

stop

stop() -> None

Shut the server down gracefully, then force-kill if needed.

run_pipeline

run_pipeline(cfg: DictConfig) -> dict

Execute the configured evaluation pipeline.

Parameters:

Name Type Description Default
cfg DictConfig

Resolved Hydra config (root EvalConfig).

required

Returns:

Type Description
dict

The aggregator summary dict (also written to summary.json).

aggregate_results

aggregate_results(*, output_dir: Path, results: Iterable[TaskResult], include: list[str] | None = None, formats: list[str] | None = None) -> dict

Collate per-evaluator results into a summary dict (and write files).

Parameters:

Name Type Description Default
output_dir Path

Run-level output directory.

required
results Iterable[TaskResult]

Iterable of TaskResult.

required
include list[str] | None

Optional whitelist of eval_name to fold into the summary; None keeps every successful result.

None
formats list[str] | None

Output formats ("json", "csv"). Defaults to both.

None

Returns:

Type Description
dict

The summary dict that is also written to disk.

run_subprocess_eval

run_subprocess_eval(*, eval_name: str, module: str, task_cfg: Any, model_name: str, output_root: Path, env_overrides: dict[str, str] | None = None, timeout_sec: int = 7200) -> TaskResult

Run one evaluator as a subprocess and return its TaskResult.

Parameters:

Name Type Description Default
eval_name str

Short identifier (e.g. "mt_bench"). Used for the per-eval output directory.

required
module str

Dotted module path of the child entrypoint (e.g. "onecomp.eval.evals.mt_bench.run").

required
task_cfg Any

Subtree of the Hydra config for this evaluator. Any omegaconf.DictConfig or plain dataclass works.

required
model_name str

Model identifier (forwarded to the child for output file naming).

required
output_root Path

Parent output directory; the child writes under //.

required
env_overrides dict[str, str] | None

Additional environment variables passed to the child (e.g. OPENAI_BASE_URL for the vLLM HTTP endpoint).

None
timeout_sec int

Wall-clock limit. -1 disables.

7200

Returns:

Type Description
TaskResult

TaskResult. status is either the child's reported

TaskResult

value or "failed" if the child crashed.

Configuration schema

EvalConfig dataclass

EvalConfig(inference: InferenceConfig = InferenceConfig(), model: ModelConfig = ModelConfig(), evals: EvalsConfig = EvalsConfig(), summary: SummaryConfig = SummaryConfig(), output_dir: str = './output/eval', log_level: str = 'INFO')

Root configuration consumed by run_evaluate.main.

InferenceConfig dataclass

InferenceConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())

vLLM OpenAI-compatible HTTP server launch parameters.

MtBenchConfig dataclass

MtBenchConfig(enabled: bool = True, data_dir: str = '', judge_model: str = 'gpt-4o-2024-08-06', judge_api_base: str = '', max_new_tokens: int = 1024, request_timeout_sec: int = 600, plot: bool = True, chart_path: str = '', subprocess_timeout_sec: int = 7200)

MT-Bench evaluator settings (kept simple; categories are dataset-driven).

ThroughputConfig dataclass

ThroughputConfig(enabled: bool = False, prompt_tokens: int = 512, max_tokens: int = 512, num_warmup: int = 2, num_trials: int = 5, temperature: float = 0.0, prompt_seed_text: str = 'This is a fixed prompt for throughput benchmarking. It compares decode performance of quantized models under the same conditions.', save_responses: bool = True, save_warmup_responses: bool = False, min_completion_tokens: int = 32, request_timeout_sec: int = 600, subprocess_timeout_sec: int = 1800)

vLLM serving throughput via Chat Completions (streaming).

Uses a fixed-length synthetic user prompt and measures TTFT, ITL (TPOT), TPS per user, TPS decode, and aggregate TPS/RPS over the measured window. Independent of MT-Bench max_new_tokens.

EvalsConfig dataclass

EvalsConfig(mt_bench: MtBenchConfig = MtBenchConfig(), throughput: ThroughputConfig = ThroughputConfig())

Container for all evaluator-specific configs.

Add new evaluators by appending a field here and an entry under evals in conf/eval_config.yaml.

SummaryConfig dataclass

SummaryConfig(include: Optional[list[str]] = None, formats: list[str] = (lambda: ['json', 'csv'])())

Aggregator settings.

include controls which evaluators are rolled into the final summary file; None means "include every evaluator that ran successfully".

TaskResult

Per-evaluator subprocess output written to <output_dir>/<eval_name>/result.json.

TaskResult dataclass

TaskResult(eval_name: str, status: Literal['success', 'failed', 'skipped'] = 'success', model: str = '', timestamp: str = '', scores: dict[str, Any] = dict(), artifacts: dict[str, str] = dict(), metadata: dict[str, Any] = dict(), error: str = '')

Per-evaluator result written by each subprocess.

Aggregator picks up files at //result.json that conform to this schema and rolls them into a summary.

create classmethod

create(eval_name: str, model: str, **kwargs: Any) -> 'TaskResult'

save

save(path: str | Path) -> Path

load classmethod

load(path: str | Path) -> 'TaskResult'

Orchestration

run_pipeline

run_pipeline(cfg: DictConfig) -> dict

Execute the configured evaluation pipeline.

Parameters:

Name Type Description Default
cfg DictConfig

Resolved Hydra config (root EvalConfig).

required

Returns:

Type Description
dict

The aggregator summary dict (also written to summary.json).

VllmServerManager

VllmServerManager(cfg: InferenceConfig, model_path: str | Path, log_dir: str | Path)

Context manager that owns a vLLM HTTP server lifecycle.

start

start() -> None

Spawn the vLLM HTTP server and wait until /health is 200.

stop

stop() -> None

Shut the server down gracefully, then force-kill if needed.

run_subprocess_eval

run_subprocess_eval(*, eval_name: str, module: str, task_cfg: Any, model_name: str, output_root: Path, env_overrides: dict[str, str] | None = None, timeout_sec: int = 7200) -> TaskResult

Run one evaluator as a subprocess and return its TaskResult.

Parameters:

Name Type Description Default
eval_name str

Short identifier (e.g. "mt_bench"). Used for the per-eval output directory.

required
module str

Dotted module path of the child entrypoint (e.g. "onecomp.eval.evals.mt_bench.run").

required
task_cfg Any

Subtree of the Hydra config for this evaluator. Any omegaconf.DictConfig or plain dataclass works.

required
model_name str

Model identifier (forwarded to the child for output file naming).

required
output_root Path

Parent output directory; the child writes under //.

required
env_overrides dict[str, str] | None

Additional environment variables passed to the child (e.g. OPENAI_BASE_URL for the vLLM HTTP endpoint).

None
timeout_sec int

Wall-clock limit. -1 disables.

7200

Returns:

Type Description
TaskResult

TaskResult. status is either the child's reported

TaskResult

value or "failed" if the child crashed.

aggregate_results

aggregate_results(*, output_dir: Path, results: Iterable[TaskResult], include: list[str] | None = None, formats: list[str] | None = None) -> dict

Collate per-evaluator results into a summary dict (and write files).

Parameters:

Name Type Description Default
output_dir Path

Run-level output directory.

required
results Iterable[TaskResult]

Iterable of TaskResult.

required
include list[str] | None

Optional whitelist of eval_name to fold into the summary; None keeps every successful result.

None
formats list[str] | None

Output formats ("json", "csv"). Defaults to both.

None

Returns:

Type Description
dict

The summary dict that is also written to disk.