Skip to content

Adoit/open-data-agent

Repository files navigation

Open Data Agent

A local CLI tool (oda) that lets developers, data analysts, data scientists, and AI agents query databases safely — with auto-generated schema context, read-only enforcement, and self-healing diagnostics.


What is Open Data Agent?

Open Data Agent (oda) is a local command-line tool that bridges the gap between natural language questions and SQL databases. It is designed to be used by:

  • Data analysts and data scientists doing exploratory data analysis (EDA) — query any connected database without memorising schema details
  • Developers running ad-hoc SQL safely, with automatic LIMIT injection and query history
  • AI coding agents (primarily OpenCode) — oda acts as a structured, read-only database interface that an agent can call as a tool, using auto-generated schema documentation as context

Key properties:

  • Read-only by default — write operations are hard-blocked with no bypass mechanism
  • Schema docs as context — human- and agent-readable markdown catalog generated from your live schema
  • Self-healing diagnostics — zero-row results and errors emit structured diagnostic context (row counts, sample values, NULL counts) so an AI agent can self-correct and retry
  • Local-first — no server, no network service, all data on disk
  • Multi-dialect — supports PostgreSQL, MySQL, and SQLite

How It Works

End-to-End Flow

Question (natural language or SQL)
  │
  ├─ [AI agent] reads .opencode/rules/data-agent.md       ← generated by `oda connect`
  ├─ [AI agent] reads docs/data-catalog/_index.md         ← generated by `oda docs generate`
  ├─ [AI agent] navigates to relevant table docs
  ├─ [AI agent] checks memory/ for known data quirks
  │
  └─ oda query "SELECT ..."
        │
        ├─ SafetyChecker       read-only whitelist, dangerous pattern detection
        ├─ LIMIT injection     auto-appends or clamps LIMIT (default 1000, max 10000)
        ├─ Timeout             server-side (PostgreSQL/MySQL) or thread-based (SQLite)
        ├─ Execution           results printed to stdout
        └─ History log         appended to ~/.config/open-data-agent/history.jsonl
              │
              └─ (zero rows or error)
                    └─ DiagnosticEngine → structured context to stderr → agent retries

Core Components

1. Schema Catalog

oda docs generate introspects your connected database and produces a hierarchical markdown catalog under docs/data-catalog/. Each table gets its own file with columns, types, nullability, sample rows, and (optionally) column statistics. This catalog is the primary context source — neither you nor the agent needs to run live introspection queries.

2. Memory Store

memory/ contains curated markdown files (with YAML frontmatter) that capture data quirks, business logic, known anomalies, and tribal knowledge about your data. The agent checks memory before constructing a query.

3. Query Engine

Every query passes through a safety pipeline:

  • Whitelist check (primary guard) — only explicitly allowed prefixes pass: SELECT, WITH, EXPLAIN, and dialect-specific commands (PRAGMA for SQLite; SHOW for PostgreSQL and MySQL; TABLE for MySQL). Any token not on the list is blocked outright.
  • Blacklist check (secondary guard) — catches injection patterns inside otherwise-valid SQL: INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, TRUNCATE, REPLACE, MERGE, GRANT, REVOKE, CALL, EXEC, EXECUTE, and dialect-specific patterns such as COPY, LOAD DATA, ATTACH DATABASE
  • LIMIT injection — if no LIMIT is present, one is appended automatically; if the limit exceeds the configured maximum, it is clamped
  • Timeout — queries are killed after a configurable timeout (default 30 seconds)
  • History — every query is appended to a local JSONL log with timing and metadata

4. Diagnostics

When a query returns zero rows or errors, DiagnosticEngine automatically emits structured context to stderr: table row counts, sample column values, and NULL counts for relevant columns. This gives an AI agent (or a developer) enough signal to self-correct and retry without running additional queries manually.

Using with OpenCode

oda is designed as a first-class tool for the OpenCode AI coding agent. When you run oda connect <name>, it renders a rules file at .opencode/rules/data-agent.md that tells OpenCode:

  • Which database is active and how to explore it
  • How to navigate the schema catalog
  • How to call oda query, oda memory, and oda docs commands
  • How to interpret diagnostic output and self-heal on failures

This means no extra LLM setup is required for natural language queries — OpenCode reads your schema docs, understands your data, constructs the SQL, and calls oda query as a tool. The full NL→SQL→results loop works out of the box.

See Asking Questions in Natural Language with OpenCode for a step-by-step walkthrough.


How to Use It

Prerequisites

  • Python 3.12+
  • uv — install with curl -LsSf https://astral.sh/uv/install.sh | sh

Install

git clone <repository-url>
cd open-data-agent
uv sync

Initialise

uv run oda init

Creates ~/.config/open-data-agent/ with default config files.

Add a Connection

uv run oda connections add

Prompts for connection name, database type (postgresql / mysql / sqlite), host, port, database name, username, and password. Passwords are stored in the OS keychain where available (macOS Keychain, GNOME Keyring, Windows Credential Manager). On headless or CI environments without a keychain backend, passwords fall back to plaintext in ~/.config/open-data-agent/connections.yaml with a warning.

Activate a Connection

uv run oda connect <name>

Sets the active connection and renders .opencode/rules/data-agent.md for OpenCode.

Note: oda connect writes the rules file relative to the current working directory. Always run this command from your project root, otherwise the file will be created in the wrong location.

Generate Schema Docs

uv run oda docs generate

# Include column statistics (null counts, distinct counts, min/max):
uv run oda docs generate --enrich

Run a Query

uv run oda query "SELECT * FROM orders LIMIT 10"

# Output as JSON or CSV:
uv run oda query "SELECT * FROM orders" --format json
uv run oda query "SELECT * FROM orders" --format csv

Command Reference

Command Description
oda init First-run setup: create config directories and default files
oda connect <name> Activate a connection; render OpenCode rules file
oda connections list List all configured connections
oda connections add Add a new connection (interactive)
oda connections remove <name> Remove a connection
oda connections test <name> Test live connectivity
oda schemas List schemas in the active database
oda tables [<schema>] List tables (optional schema filter)
oda describe <table> Show columns and types
oda sample <table> [--n N] Show N sample rows (default 5)
oda profile <table> Column statistics: null count, distinct, min, max
oda query "<sql>" Execute SQL; auto-logged to history
oda docs generate [--enrich] Generate schema documentation catalog
oda docs status Show freshness report for schema docs
oda memory list List memory entries
oda memory add Add a memory entry
oda memory search <term> Search memory entries
oda history list [--n N] Show most recent N query history entries
oda history search <term> Search query history
oda history stats Query history statistics

Configuration

Global config lives at ~/.config/open-data-agent/config.yaml:

row_limit: 1000            # default LIMIT auto-injected if absent
max_row_limit: 10000       # hard ceiling; never exceeded
query_timeout_seconds: 30  # query execution timeout
docs_staleness_days: 7     # warn if schema docs older than this
log_level: INFO
strict_mode: false         # if true: block queries when docs are stale
                           # equivalent to passing --strict on every oda query call

Testing with Sample Data

A ready-to-run guide using the Northwind database is available at docs/testing-with-northwind.md. It covers spinning up a local PostgreSQL container, registering the connection, generating schema docs, and running sample queries that exercise every major oda feature.

# One command to get started:
bash scripts/setup-northwind.sh

Development

# Unit tests (no external database required)
uv run pytest tests/unit/ -q

# Lint and format
uv run ruff check .
uv run ruff format --check .

# Type check
uv run mypy src/open_data_agent

# Integration tests (requires Docker)
docker compose up -d
uv run pytest -m integration
docker compose down

Asking Questions in Natural Language with OpenCode

OpenCode is an AI coding agent that runs in your terminal. When paired with oda, it acts as a natural language interface to your database — you ask a question in plain English, OpenCode reads your schema docs, writes the SQL, and calls oda query to execute it. No extra LLM configuration or API keys are needed beyond your OpenCode setup.

Prerequisites

  • OpenCode installed and configured (opencode available in your PATH)
  • oda set up with a connection and schema docs generated (see How to Use It)

Step 1 — Connect and generate schema docs

# Run from your project root — oda connect writes .opencode/rules/data-agent.md
# relative to the current working directory
uv run oda connect my-db

# Generate the schema catalog OpenCode will use as context
uv run oda docs generate

After this, docs/data-catalog/ contains a markdown file for every table in your database. OpenCode reads these files to understand your schema without running any live introspection queries.

Step 2 — Open OpenCode in your project

opencode

OpenCode automatically loads .opencode/rules/data-agent.md on startup. This file tells it which database is active, how to navigate the schema catalog, and how to use oda commands.

Step 3 — Ask a question in natural language

Type your question directly in the OpenCode chat. Examples:

How many orders were placed last month, broken down by status?
Which customers have spent more than $10,000 in total?
Show me the top 10 products by revenue in Q1 2025.

OpenCode will:

  1. Read the relevant table docs from docs/data-catalog/
  2. Check memory/ for any known data quirks affecting those tables
  3. Construct a safe, read-only SQL query
  4. Call oda query "..." to execute it
  5. Present the results — and if zero rows are returned, use the diagnostic output to self-correct and retry

Step 4 — Refine with follow-up questions

You can ask follow-up questions in the same session:

Filter that to just the 'enterprise' customer segment.
Now group by region instead of status.
Export that as CSV.

OpenCode maintains conversation context, so follow-ups build on the previous query.

Step 5 — Save data knowledge to memory

When you or OpenCode discovers something important about your data (a quirky column, a misleading field name, a known data quality issue), save it so future sessions benefit:

uv run oda memory add --title "Revenue column" \
  --category data_quality \
  --content "Use net_item_price not item_price — item_price includes tax"

Or ask OpenCode to do it:

Remember that the revenue column to use is net_item_price, not item_price.

How OpenCode self-heals

If a query returns zero rows, oda query automatically emits structured diagnostics to stderr:

  • Row counts for each table referenced in the SQL
  • Sample values for filter columns (e.g. Column 'status' sample values: ['active', 'pending'])
  • NULL counts for filter columns

OpenCode reads this output and retries with a corrected query, without any input from you.

Keeping schema docs fresh

If your database schema changes, regenerate the catalog before asking questions:

uv run oda docs generate

# Check freshness at any time:
uv run oda docs status

Use --strict mode to block queries against stale docs:

uv run oda query "SELECT ..." --strict

License

MIT License

Copyright (c) 2026 Adoit

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

A local CLI agent that connects to databases and answers data questions using natural language and SQL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors