Evaluation¶
Hydra-driven evaluation harness for models served through vLLM (onecomp-eval CLI).
For installation, MT-Bench data download, judge API keys, and CLI examples, see the Evaluation user guide.
ModelConfig here is eval-specific
onecomp.eval.ModelConfig only holds the model path/name for evaluation.
It is not the same as onecomp.ModelConfig used for quantization.
Package exports¶
eval ¶
OneComp Evaluation Harness.
End-to-end evaluation pipeline:
- run_evaluate.main is the Hydra-driven entry point (CLI: onecomp-eval). It manages vLLM server lifecycle, dispatches each enabled evaluator as a subprocess, and aggregates the per-evaluator result files into a unified summary.
- Evaluators live under onecomp.eval.evals and each exposes a child run.py that the orchestrator launches via python -m. Result files conform to TaskResult.
Currently supported evaluators:
- mt_bench -- MT-Bench (default: English), full pipeline
Copyright 2025-2026 Fujitsu Ltd.
EvalConfig
dataclass
¶
EvalConfig(inference: InferenceConfig = InferenceConfig(), model: ModelConfig = ModelConfig(), evals: EvalsConfig = EvalsConfig(), summary: SummaryConfig = SummaryConfig(), output_dir: str = './output/eval', log_level: str = 'INFO')
Root configuration consumed by run_evaluate.main.
EvalsConfig
dataclass
¶
EvalsConfig(mt_bench: MtBenchConfig = MtBenchConfig(), throughput: ThroughputConfig = ThroughputConfig())
Container for all evaluator-specific configs.
Add new evaluators by appending a field here and an entry under evals in conf/eval_config.yaml.
ModelConfig
dataclass
¶
Model under evaluation.
MtBenchConfig
dataclass
¶
MtBenchConfig(enabled: bool = True, data_dir: str = '', judge_model: str = 'gpt-4o-2024-08-06', judge_api_base: str = '', max_new_tokens: int = 1024, request_timeout_sec: int = 600, plot: bool = True, chart_path: str = '', subprocess_timeout_sec: int = 7200)
MT-Bench evaluator settings (kept simple; categories are dataset-driven).
ThroughputConfig
dataclass
¶
ThroughputConfig(enabled: bool = False, prompt_tokens: int = 512, max_tokens: int = 512, num_warmup: int = 2, num_trials: int = 5, temperature: float = 0.0, prompt_seed_text: str = 'This is a fixed prompt for throughput benchmarking. It compares decode performance of quantized models under the same conditions.', save_responses: bool = True, save_warmup_responses: bool = False, min_completion_tokens: int = 32, request_timeout_sec: int = 600, subprocess_timeout_sec: int = 1800)
vLLM serving throughput via Chat Completions (streaming).
Uses a fixed-length synthetic user prompt and measures TTFT, ITL (TPOT), TPS per user, TPS decode, and aggregate TPS/RPS over the measured window. Independent of MT-Bench max_new_tokens.
InferenceConfig
dataclass
¶
InferenceConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())
vLLM OpenAI-compatible HTTP server launch parameters.
VllmServerConfig
dataclass
¶
VllmServerConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())
SummaryConfig
dataclass
¶
SummaryConfig(include: Optional[list[str]] = None, formats: list[str] = (lambda: ['json', 'csv'])())
Aggregator settings.
include controls which evaluators are rolled into the final summary file; None means "include every evaluator that ran successfully".
TaskResult
dataclass
¶
TaskResult(eval_name: str, status: Literal['success', 'failed', 'skipped'] = 'success', model: str = '', timestamp: str = '', scores: dict[str, Any] = dict(), artifacts: dict[str, str] = dict(), metadata: dict[str, Any] = dict(), error: str = '')
Per-evaluator result written by each subprocess.
Aggregator picks up files at
VllmServerManager ¶
run_pipeline ¶
Execute the configured evaluation pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
DictConfig
|
Resolved Hydra config (root EvalConfig). |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The aggregator summary dict (also written to summary.json). |
aggregate_results ¶
aggregate_results(*, output_dir: Path, results: Iterable[TaskResult], include: list[str] | None = None, formats: list[str] | None = None) -> dict
Collate per-evaluator results into a summary dict (and write files).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Run-level output directory. |
required |
results
|
Iterable[TaskResult]
|
Iterable of TaskResult. |
required |
include
|
list[str] | None
|
Optional whitelist of eval_name to fold into the summary; None keeps every successful result. |
None
|
formats
|
list[str] | None
|
Output formats ("json", "csv"). Defaults to both. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
The summary dict that is also written to disk. |
run_subprocess_eval ¶
run_subprocess_eval(*, eval_name: str, module: str, task_cfg: Any, model_name: str, output_root: Path, env_overrides: dict[str, str] | None = None, timeout_sec: int = 7200) -> TaskResult
Run one evaluator as a subprocess and return its TaskResult.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eval_name
|
str
|
Short identifier (e.g. "mt_bench"). Used for the per-eval output directory. |
required |
module
|
str
|
Dotted module path of the child entrypoint (e.g. "onecomp.eval.evals.mt_bench.run"). |
required |
task_cfg
|
Any
|
Subtree of the Hydra config for this evaluator. Any omegaconf.DictConfig or plain dataclass works. |
required |
model_name
|
str
|
Model identifier (forwarded to the child for output file naming). |
required |
output_root
|
Path
|
Parent output directory; the child writes under
|
required |
env_overrides
|
dict[str, str] | None
|
Additional environment variables passed to the child (e.g. OPENAI_BASE_URL for the vLLM HTTP endpoint). |
None
|
timeout_sec
|
int
|
Wall-clock limit. -1 disables. |
7200
|
Returns:
| Type | Description |
|---|---|
TaskResult
|
TaskResult. status is either the child's reported |
TaskResult
|
value or "failed" if the child crashed. |
Configuration schema¶
EvalConfig
dataclass
¶
EvalConfig(inference: InferenceConfig = InferenceConfig(), model: ModelConfig = ModelConfig(), evals: EvalsConfig = EvalsConfig(), summary: SummaryConfig = SummaryConfig(), output_dir: str = './output/eval', log_level: str = 'INFO')
Root configuration consumed by run_evaluate.main.
InferenceConfig
dataclass
¶
InferenceConfig(mode: str = 'vllm_server', dtype: str = 'auto', trust_remote_code: bool = True, request_timeout_sec: int = 600, host: str = '127.0.0.1', port: int = 0, api_key: str = 'EMPTY', tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.85, max_model_len: int = 4096, quantization: Optional[str] = None, enforce_eager: bool = False, startup_timeout_sec: int = 600, extra_args: list[str] = list())
vLLM OpenAI-compatible HTTP server launch parameters.
MtBenchConfig
dataclass
¶
MtBenchConfig(enabled: bool = True, data_dir: str = '', judge_model: str = 'gpt-4o-2024-08-06', judge_api_base: str = '', max_new_tokens: int = 1024, request_timeout_sec: int = 600, plot: bool = True, chart_path: str = '', subprocess_timeout_sec: int = 7200)
MT-Bench evaluator settings (kept simple; categories are dataset-driven).
ThroughputConfig
dataclass
¶
ThroughputConfig(enabled: bool = False, prompt_tokens: int = 512, max_tokens: int = 512, num_warmup: int = 2, num_trials: int = 5, temperature: float = 0.0, prompt_seed_text: str = 'This is a fixed prompt for throughput benchmarking. It compares decode performance of quantized models under the same conditions.', save_responses: bool = True, save_warmup_responses: bool = False, min_completion_tokens: int = 32, request_timeout_sec: int = 600, subprocess_timeout_sec: int = 1800)
vLLM serving throughput via Chat Completions (streaming).
Uses a fixed-length synthetic user prompt and measures TTFT, ITL (TPOT), TPS per user, TPS decode, and aggregate TPS/RPS over the measured window. Independent of MT-Bench max_new_tokens.
EvalsConfig
dataclass
¶
EvalsConfig(mt_bench: MtBenchConfig = MtBenchConfig(), throughput: ThroughputConfig = ThroughputConfig())
Container for all evaluator-specific configs.
Add new evaluators by appending a field here and an entry under evals in conf/eval_config.yaml.
SummaryConfig
dataclass
¶
SummaryConfig(include: Optional[list[str]] = None, formats: list[str] = (lambda: ['json', 'csv'])())
Aggregator settings.
include controls which evaluators are rolled into the final summary file; None means "include every evaluator that ran successfully".
TaskResult¶
Per-evaluator subprocess output written to <output_dir>/<eval_name>/result.json.
TaskResult
dataclass
¶
TaskResult(eval_name: str, status: Literal['success', 'failed', 'skipped'] = 'success', model: str = '', timestamp: str = '', scores: dict[str, Any] = dict(), artifacts: dict[str, str] = dict(), metadata: dict[str, Any] = dict(), error: str = '')
Per-evaluator result written by each subprocess.
Aggregator picks up files at
Orchestration¶
run_pipeline ¶
Execute the configured evaluation pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cfg
|
DictConfig
|
Resolved Hydra config (root EvalConfig). |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The aggregator summary dict (also written to summary.json). |
VllmServerManager ¶
run_subprocess_eval ¶
run_subprocess_eval(*, eval_name: str, module: str, task_cfg: Any, model_name: str, output_root: Path, env_overrides: dict[str, str] | None = None, timeout_sec: int = 7200) -> TaskResult
Run one evaluator as a subprocess and return its TaskResult.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eval_name
|
str
|
Short identifier (e.g. "mt_bench"). Used for the per-eval output directory. |
required |
module
|
str
|
Dotted module path of the child entrypoint (e.g. "onecomp.eval.evals.mt_bench.run"). |
required |
task_cfg
|
Any
|
Subtree of the Hydra config for this evaluator. Any omegaconf.DictConfig or plain dataclass works. |
required |
model_name
|
str
|
Model identifier (forwarded to the child for output file naming). |
required |
output_root
|
Path
|
Parent output directory; the child writes under
|
required |
env_overrides
|
dict[str, str] | None
|
Additional environment variables passed to the child (e.g. OPENAI_BASE_URL for the vLLM HTTP endpoint). |
None
|
timeout_sec
|
int
|
Wall-clock limit. -1 disables. |
7200
|
Returns:
| Type | Description |
|---|---|
TaskResult
|
TaskResult. status is either the child's reported |
TaskResult
|
value or "failed" if the child crashed. |
aggregate_results ¶
aggregate_results(*, output_dir: Path, results: Iterable[TaskResult], include: list[str] | None = None, formats: list[str] | None = None) -> dict
Collate per-evaluator results into a summary dict (and write files).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
Path
|
Run-level output directory. |
required |
results
|
Iterable[TaskResult]
|
Iterable of TaskResult. |
required |
include
|
list[str] | None
|
Optional whitelist of eval_name to fold into the summary; None keeps every successful result. |
None
|
formats
|
list[str] | None
|
Output formats ("json", "csv"). Defaults to both. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
The summary dict that is also written to disk. |