Research-grade benchmark and verification platform for LLM agents, RAG systems, and tool-using workflows.
OpenAgentBench evaluates agents as stateful control systems, not as transcript generators. It scores final outcomes, environment-state correctness, tool-selection optimality, privilege safety, memory hygiene, grounding faithfulness, recovery behavior, multi-agent coordination quality, and efficiency under failure and adversarial conditions.
- Overview
- Why OpenAgentBench
- What Makes It Different
- Core Capabilities
- Evaluation Dimensions
- System Architecture
- Protocol Stack
- Key Design Decisions
- Execution Model
- Retrieval and Evidence Model
- Memory Model
- Tool Governance Model
- Repository Layout
- BenchSpec Contract
- Quick Start
- Example Scenario
- Run Outputs
- Scoring Model
- Reliability, Safety, and Performance
- CI/CD and Regression Workflow
- Roadmap
- Contributing
- License
OpenAgentBench is an open benchmark and verification stack for evaluating:
- LLM agents
- RAG pipelines
- tool-calling systems
- browser and desktop agents
- code agents
- multi-agent orchestration systems
The project is designed to fill the control-plane evaluation gap in modern agentic systems. Most current stacks score outputs, provide observability, or offer custom evaluators. OpenAgentBench goes deeper: it verifies whether the agent changed the world correctly, chose the right tool, respected privilege boundaries, maintained clean memory, recovered safely from failures, and remained grounded in admissible evidence.
OpenAgentBench is not a prompt library and not a generic agent framework. It is a benchmark, verification, and failure-injection platform for testing agentic systems rigorously under typed, reproducible contracts.
The hardest agent failures are increasingly not answer-only failures.
They are control failures:
- choosing the wrong tool despite having a better admissible option
- mutating the environment incorrectly while producing a plausible transcript
- escalating privileges unnecessarily
- leaking or reusing stale memory across sessions
- answering correctly but without evidentiary grounding
- failing to recover after timeouts, malformed tool responses, or partial outages
- collapsing under multi-agent coordination complexity
- accepting adversarial tool descriptions or prompt injections
Existing evaluation stacks cover portions of this problem space. Very few unify the following into a single open system:
- outcome-state grading
- tool-governance evaluation
- memory contamination testing
- security adversary packs
- chaos engineering for agents
- multi-agent coordination scoring
- cost/latency/tool-use Pareto analysis
OpenAgentBench is designed to become the open standard for that missing layer.
| Capability | Conventional LLM Evals | OpenAgentBench |
|---|---|---|
| Final-answer scoring | Yes | Yes |
| Transcript rubric evaluation | Yes | Yes |
| Real environment-state verification | Limited | Native |
| Tool-selection optimality scoring | Rare | Native |
| Privilege misuse and escalation checks | Rare | Native |
| Memory contamination benchmarking | Emerging | Native |
| RAG faithfulness vs answer correctness separation | Partial | Native |
| Failure injection / chaos testing | Ad hoc | Native |
| Multi-agent delegation and merge scoring | Rare | Native |
| Security adversary suites for agent control flows | Rare | Native |
| Pareto reporting across quality, cost, and latency | Partial | Native |
-
Outcome-State Grading Engine
Grades the real environment state after execution, not only the model transcript. -
Tool-Selection Optimality Benchmark
Scores whether the agent selected the best admissible tool given task requirements, privilege constraints, cost, latency, and environment state. -
Privilege-Aware Tool Misuse Suite
Injects risky or overpowered tool options and checks whether the agent escalates unnecessarily. -
Memory Contamination Benchmark
Measures cross-session leakage, stale memory retrieval, poisoned memory reuse, and failure to respect invalidation events. -
RAG Evidence-Faithfulness Checker
Separates “correct answer” from “correctly grounded answer” using provenance-tagged evidence verification. -
Multi-Agent Coordination Benchmark
Scores decomposition quality, delegation correctness, duplication waste, coordinator collapse, and merge correctness. -
Trace-to-State Delta Visualizer
Shows which trace event changed the world state, whether the mutation was valid, and which policy constraints were implicated. -
Agent Chaos Lab
Perturbs tool docs, latency, network availability, response schemas, partial failures, and malformed outputs. -
Pareto Frontier Evaluator
Reports quality versus cost versus latency versus tool-call count, rather than reporting accuracy in isolation. -
Security Adversary Pack
Evaluates prompt injection, tool description poisoning, policy bypass, rollback abuse, and unsafe recovery behavior.
OpenAgentBench evaluates agentic systems across the following dimensions:
| Dimension | What Is Measured | Example Failure |
|---|---|---|
| Outcome correctness | Whether the desired real-world state was achieved | Transcript claims success but file/database/browser state is unchanged |
| State-transition correctness | Whether each mutation was valid and policy-compliant | Agent writes invalid configuration and later masks the error |
| Tool-selection optimality | Whether the chosen tool was the best admissible option | Uses high-privilege admin tool for a read-only task |
| Privilege safety | Whether the agent escalated or misused capabilities | Unnecessary access to destructive or privileged tools |
| Evidence faithfulness | Whether claims are grounded in retrieved evidence | Correct answer unsupported by authoritative evidence |
| Memory hygiene | Whether memory remained isolated, fresh, and valid | Cross-session leakage or use of revoked instructions |
| Recovery behavior | Whether failures were handled safely and effectively | Infinite retries, missing fallback, unrecoverable loop |
| Coordination quality | Whether multi-agent delegation and merge behavior were efficient and correct | Duplicate work, merge conflicts, coordinator collapse |
| Efficiency | Cost, latency, and tool-call overhead | Correct result delivered with unacceptable operational cost |
flowchart TD
U[User / CI / SDK] -->|JSON-RPC| API[Control Plane API]
API --> ORCH[Benchmark Orchestrator]
ORCH --> SPEC[BenchSpec Validator]
ORCH --> CTX[Context Compiler]
ORCH --> RET[Deterministic Retrieval Engine]
ORCH --> MEM[Memory Manager]
ORCH --> TOOL[MCP Tool Registry]
ORCH --> ENV[Environment Adapters]
ORCH --> RUN[Agent Runtime Adapter]
RUN --> VERIFY[Verifier / Critic / Repair Loop]
ENV --> SCORE[Scoring Engine]
RUN --> TRACE[Trace Capture]
VERIFY --> SCORE
TRACE --> OBS[Logs / Metrics / Traces]
SCORE --> ART[Reports / Replays / Visualizations]
OBS --> ART