generalaimodels / OpenAgentBench Star 0 Code Issues Pull requests Research-grade evaluation & verification platform for LLM agents, RAG pipelines, and tool-using workflows — grading tool-choice optimality, state-transition correctness, memory hygiene, privilege safety, recovery behavior, and multi-agent coordination beyond answer scoring. multi-agent-systems ai-safety memory-benchmark chaos-engineering rag-evaluation agentic-ai agent-evaluation llm-benchmark tool-use-verification prompt-injection-testing tool-selection-benchmark memory-contamination-benchmark agent-security-red-teaming multi-agent-benchmark privilege-aware-tooling agent-chaos-engineering provenance-based-verification Updated Mar 18, 2026