Experiments
AgentV eval files are the only runnable authoring artifact. Use top-level
experiment: inside eval.yaml for runtime choices: targets, workers,
timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy.
name: support-regression
experiment: targets: [codex-gpt5, claude-sonnet] workers: 2 timeout_seconds: 720 repeat: count: 4 strategy: pass_at_k cost_limit_usd: 2.00
workspace: hooks: before_all: command: ["bash", "-lc", "bun install && bun run build"]
tests: - id: refund-eligibility input: Can this customer get a refund? criteria: Applies the refund policy correctlyexecution: is accepted only as a legacy top-level alias for existing eval
files. Do not use both experiment: and execution: in the same eval.
Tests Imports
Section titled “Tests Imports”Use tests[] for composition, imports, and selection.
tests: - include: evals/support/*.eval.yaml type: suite select: test_ids: - refund-* - missing-order-date tags: regression metadata: priority: high run: threshold: 1.0 repeat: count: 2 strategy: pass_all - include: cases/*.cases.yaml type: tests - include: cases/regression.jsonl type: tests - cases/smoke/*.cases.yamltype: suite preserves the imported suite’s task contract: metadata,
workspace, shared input, shared assertions, and tests. The parent eval
still owns the single run bundle. Runtime defaults from an imported suite apply
only where they can be scoped to that suite’s tests: threshold, repeat or
runs, timeout_seconds, and budget_usd. If the parent eval supplies one of
those defaults, the parent value wins for imported tests. Fields that cannot be
scoped inside one parent run, such as target, targets, workers,
workspace, agent, model, agent_options, and sandbox, must be supplied
by the parent experiment when importing the suite.
type: tests imports only raw test entries. It intentionally drops shared
context from an imported eval suite, so parent suite fields apply to those raw
cases.
tests[].select.test_ids filters imported test IDs with glob patterns.
tests[].select.tags filters each imported case’s effective metadata.tags.
Effective case tags are suite-first and deduped:
suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags
still remain suite identity metadata for discovery and reporting; selection reads
the merged case metadata view. tests[].select.metadata filters case metadata by
key/value, where selector values may be scalars or lists. Globbed include paths
are resolved in deterministic path order, then test order.
String-valued tests and string entries inside tests[] are raw-case import
shorthand. They are equivalent to include with type: tests and may point at
raw case files, directories, or globs. Importing another eval suite must use
object form with include: and type: suite.
Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does
not recursively load suite runtime blocks.
Imported suite rows keep their source suite metadata in index.jsonl. Use each
row’s result_dir as the authoritative path to generated artifacts inside the
run directory; do not infer layout from suite names.
Scoped Run Overrides
Section titled “Scoped Run Overrides”Use scoped run: blocks for result interpretation and scheduling policies that
vary by include group or test case. Precedence is:
test.run > tests[].run > parent experiment > imported suite experiment defaultsexperiment: target: agent threshold: 0.8 repeat: count: 3 strategy: pass_at_k
tests: - include: ./evals/flaky-agentic/**/*.eval.yaml type: suite select: tags: [agentic] run: repeat: count: 3 strategy: pass_at_k
- include: ./evals/regression/**/*.eval.yaml type: suite select: tags: [must-pass] run: threshold: 1.0 repeat: count: 2 strategy: pass_all
- id: critical-case input: "..." criteria: Must pass exactly run: threshold: 1.0 repeat: count: 1Scoped run: supports threshold, repeat, timeout_seconds, and
budget_usd. Candidate-changing fields such as target and targets stay
parent-level under experiment:. Workspace mutation belongs in
workspace.hooks, and runner-specific setup belongs in targets[].hooks.
Lifecycle Ownership
Section titled “Lifecycle Ownership”experiment: configures evaluation policy. It does not own commands that
prepare files, dependencies, repos, or target-specific runner state.
| Need | Put it in |
|---|---|
| Install dependencies, build the repo, seed files | workspace.hooks.before_all |
| Reset or apply per-case state | workspace.hooks.before_each / workspace.hooks.after_each |
| Configure an agent runner or provider variant | targets[].hooks |
| Choose targets, repeats, pass policy, budget, threshold | experiment |
workspace: hooks: before_all: command: ["bash", "-lc", "bun install && bun run build"]
targets: - name: agent-with-skills provider: codex hooks: before_each: command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
experiment: target: agent-with-skills repeat: count: 3 strategy: pass_at_kRepeat Runs
Section titled “Repeat Runs”repeat supports the same core strategies as repeated attempts:
experiment: repeat: count: 3 strategy: mean cost_limit_usd: 1.50Supported strategies:
| Strategy | Behavior |
|---|---|
pass_at_k | Uses the best passing attempt; early-exits by default unless early_exit: false is set |
pass_all | Uses the weakest attempt score, so every repeated attempt must meet the threshold |
mean | Aggregates repeated attempt scores by mean |
confidence_interval | Uses the lower bound of a 95% confidence interval as the conservative score |
AgentV also accepts runs and early_exit under experiment: as shorthand for
repeat-run policy:
experiment: runs: 4 early_exit: trueDo not set both repeat and runs in the same runtime block.
Result Layout
Section titled “Result Layout”Default eval runs write to:
.agentv/results/<eval-name>/<timestamp>/Imported source suite metadata appears in index.jsonl rows and manifests.
AgentV does not add a redundant suite directory when the result group is already
the eval name.