Skip to content

Imdevsup/trifecta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cognitive Triad Simulation (CTS)

A 30-day AI agent experiment that tests whether three LLM agents with distinct cognitive personalities can learn, share knowledge, and outperform a single model answering cold.

Three agents --- Alpha, Beta, and Gamma --- are given a 29-day university-grade curriculum spanning mathematics, physics, chemistry, biology, computer science, philosophy, economics, and more. They learn from lectures, ask follow-up questions, have peer conversations, share knowledge, and manage their memory stores. On Day 30, they face a 30-question elimination test. Score below 60% and you are permanently eliminated.

A solo baseline (same model, no knowledge, no persona) answers the same questions as the control group. The experiment measures whether 29 days of structured learning adds anything over the model's raw training knowledge.


The Hypothesis

LLM agents with specialized cognitive roles and constrained memory stores, forced to learn collaboratively under elimination pressure, will develop richer knowledge representations than a single model answering from its training data alone.

The experiment also tests:

  • Whether different knowledge storage strategies (fast recall vs deep reasoning vs axiomatic truths) produce different outcomes
  • Whether peer conversation and knowledge sharing between agents improves learning
  • Whether memory overflow and forced forgetting create interesting knowledge management behaviors
  • Whether a structured multi-agent system adds measurable value over a monolithic model

The Three Agents

Alpha --- The Impulsive Mind

Fast, instinctive, pattern-matching. Stores almost everything as impulse memory (quick-recall facts). Only goes deep when something is genuinely paradigm-shifting. Alpha fires first, remembers broadly, and relies on gut-level pattern recognition.

  • Primary store: impulse
  • Personality file: agents/alpha.py
  • System prompt defined in: ALPHA_SYSTEM_PROMPT at the top of that file

Beta --- The Deep Thinker

Methodical, analytical, thorough. Stores almost everything in deep thinking memory with reasoning chains, confidence scores, and cross-domain connections. Beta doesn't memorize --- it understands. Slower but deeper.

  • Primary store: deep_thinking
  • Personality file: agents/beta.py
  • System prompt defined in: BETA_SYSTEM_PROMPT at the top of that file

Gamma --- The Axiom Guardian

Rigorous, principled, conservative. Evaluates every piece of knowledge for axiom-worthiness: is this a universal truth, always true, everywhere, without exception? If yes, it becomes an axiom. If not, it gets categorized as deep or impulse. Gamma also serves as the final validator for all axiom proposals from the other agents --- even its own candidates face the same scrutiny.

  • Primary store: axiom
  • Personality file: agents/gamma.py
  • System prompt defined in: GAMMA_SYSTEM_PROMPT at the top of that file
  • Axiom validation pipeline: validate_axiom() method in agents/gamma.py

Knowledge Architecture

Each agent maintains three separate knowledge stores:

Store Max Entries Max Tokens/Entry Purpose
Impulse 50 100 Quick-recall facts, definitions, constants
Deep Thinking 200 500 Reasoning chains, multi-step logic, cross-domain connections
Axiom 100 250 Universal truths, foundational principles, validated theorems

Overflow and Silent Deletion

When a store reaches capacity, the entry with the lowest utility score (combination of access frequency, confidence, and recency) is silently deleted. The agent receives no notification --- it simply won't find that knowledge on next recall. This simulates how memories degrade under information overload.

Every deletion is logged to the overflow_events database table for post-hoc analysis.

Deduplication

Before adding any entry, the system computes a normalized fingerprint (sorted unique tokens) and checks Jaccard similarity against all existing entries. Entries with similarity >= 0.7 are rejected as duplicates.

Knowledge Retrieval

Retrieval uses a custom TF-IDF implementation with unigram + bigram matching and cosine similarity. No external ML libraries required.


Daily Simulation Loop

Each of the 29 learning days follows this phase cycle:

WAKE --> TEACHING --> LEARNING --> PEER CONVERSATION --> KNOWLEDGE SHARING --> KNOWLEDGE MANAGEMENT --> SLEEP

Phase 1: WAKE

System briefing with elimination countdown. Progressive urgency increases as Day 30 approaches. Agents receive adaptive study guidance based on their weak topics (meta-learning).

Phase 2: TEACHING

An oracle LLM generates comprehensive lectures for the day's curriculum topic, split into subtopics. All agents absorb each lecture according to their cognitive personality --- Alpha stores quick facts, Beta builds reasoning chains, Gamma evaluates for axiom-worthiness.

Phase 3: LEARNING (Q&A)

Each agent asks 5 follow-up questions targeting gaps in their knowledge. The oracle answers each question, and agents store the answers per their persona.

Phase 4: PEER CONVERSATION

Agents pair up (Alpha-Beta, Beta-Gamma, Alpha-Gamma) for 8-12 message exchanges per pair. They share insights, challenge each other's understanding, and learn from opposing cognitive perspectives.

Phase 5: KNOWLEDGE SHARING

Smart inter-agent knowledge transfer. The system checks fingerprint similarity before sharing --- only genuinely new knowledge gets transferred. High-access impulse entries get promoted to deep thinking. High-confidence deep entries become axiom candidates.

Phase 6: KNOWLEDGE MANAGEMENT

Agents review and manage their stores. Axiom proposals from all agents are collected and sent through Gamma's validation pipeline:

  1. The candidate axiom is evaluated for universality
  2. Gamma checks for conflicts with existing axioms
  3. On conflict, Alpha and Beta are consulted
  4. Gamma makes the final ruling (accept/reject)

All axioms --- including Gamma's own proposals --- must pass this pipeline.

Phase 7: SLEEP

Conversation histories are cleared (simulating sleep). Knowledge is consolidated every 5 days (clusters of 3+ similar entries are merged into distilled summaries). All stores are persisted to disk and snapshotted to SQLite.


Day 30: Elimination Test

On Day 30, the simulation generates 30 test questions from the 29-day curriculum:

  • 10 impulse questions: Quick-recall factual questions (1-2 sentence answers)
  • 10 deep questions: Multi-step analytical questions requiring reasoning across concepts
  • 10 axiom questions: True/false claims about fundamental principles, requiring justification

Each agent answers all 30 questions individually using their accumulated knowledge stores.

A solo baseline answers the same 30 questions using the same underlying model but with no knowledge stores and no agent persona. This is the control group.

Scoring

An evaluator LLM grades each answer on a 1-10 scale using a strict rubric:

Score Meaning
1-2 Wrong, irrelevant, or nonsensical
3-4 Partially correct but major errors or critical omissions
5-6 Correct core idea but shallow, vague, or missing important details
7-8 Mostly correct and well-reasoned, minor gaps
9 Excellent --- accurate, thorough, and well-structured
10 Perfect --- flawless, comprehensive, demonstrates deep mastery

Agents scoring below 60% overall (18/30 questions) are permanently eliminated.


29-Day Curriculum

The curriculum covers university-grade material across all major academic domains:

Days Domain Topics
1-3 Pure Mathematics Foundations/Logic/Number Theory, Algebra/Linear Algebra, Analysis/Topology
4-6 Physics Classical Mechanics/Thermo, Electromagnetism/Relativity, Quantum/Particle
7 Chemistry Atomic Structure through Biochemistry
8-9 Biology Molecular Bio/Genetics/Evolution, Physiology/Ecology/Earth Sciences
10-11 Computer Science Algorithms/Complexity, Systems/Networking/Databases
12 Information Theory Probability, Statistics, ML Theory
13-14 Philosophy Ancient through Modern, Analytic/Continental/Contemporary
15 Ethics & Law Metaethics, Normative Ethics, Political Philosophy, Jurisprudence
16 Philosophy of Mind Consciousness, Cognitive Science, Free Will
17-18 Economics Micro/Macro, Game Theory, Behavioral Economics
19-20 Psychology & Linguistics Cognitive/Social Psychology, Syntax/Semantics/Pragmatics
21-22 History & Religion World History, Comparative Religion
23-24 Arts & Media Music Theory/Visual Arts, Film/Media/Digital Culture
25-26 Advanced STEM Causal Inference, Unsolved Problems (P=NP, Riemann, etc.)
27 Interdisciplinary Complex Systems, Network Science, Chaos Theory
28 Applied Engineering, Medicine, Agriculture, Urban Planning
29 Meta & Integration Epistemology, Research Methods, Cross-Domain Synthesis
30 FINAL TEST 30 questions across all domains

Each day's topic contains 6-7 detailed subtopics, each generating a comprehensive lecture.


Project Structure

Trifecta/
  main.py                          # Entry point --- run the full simulation
  config.py                        # All configuration: models, limits, curriculum, parameters
  requirements.txt                 # Python dependencies
  run.sh                           # Shell helper script
  .env.example                     # Template for API keys
  .gitignore

  agents/
    base_agent.py                  # Abstract base: LLM calls, knowledge injection, test answers
    alpha.py                       # Alpha: impulsive, pattern-matching
    beta.py                        # Beta: deep analytical thinker
    gamma.py                       # Gamma: axiom guardian and validator

  knowledge/
    store.py                       # KnowledgeStore: add, evict, retrieve (TF-IDF), deduplicate

  simulation/
    environment.py                 # Day loop orchestrator: phases, agent coordination
    question_oracle.py             # Oracle LLM: generates lectures and answers questions
    communication.py               # Peer conversation bus: agent-to-agent exchanges
    curriculum.py                  # Topic parser: splits curriculum entries into subtopics
    curriculum_test.py             # Day 30 test: generates questions, tests agents, baseline
    evaluator.py                   # Scoring LLM: generates questions, grades answers 1-10

  sim_logging/
    db.py                          # SQLite logger: interactions, mutations, conversations, snapshots
    export.py                      # Post-sim export: summary stats, knowledge flow, survival report

  data/                            # Generated at runtime (gitignored)
    alpha/                         #   impulse.json, deep_thinking.json, axiom.json
    beta/                          #   impulse.json, deep_thinking.json, axiom.json
    gamma/                         #   impulse.json, deep_thinking.json, axiom.json
    simulation.db                  #   SQLite database with all logged events

  analysis/                        # Generated after simulation (gitignored)
    summary_stats.json             #   Per-agent token usage, store sizes, growth curves
    knowledge_flow.json            #   Knowledge mutation flow data
    survival_report.md             #   Final test results and survival verdicts

Setup

Prerequisites

  • Python 3.10+
  • An OpenAI API key (or any OpenAI-compatible API endpoint)

Installation

git clone <repo-url>
cd Trifecta

# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate      # Linux/Mac
# or: venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Configuration

  1. Copy the example environment file and add your API key:
cp .env.example .env
  1. Edit .env:
OPENAI_API_KEY=sk-your-key-here
OPENAI_MODEL=gpt-4o-mini

The simulation uses this model for all three agents, the oracle (lectures + Q&A), and the evaluator (test scoring). gpt-4o-mini is recommended for cost efficiency.


Running the Simulation

Full 30-Day Run

python main.py --days 30 --seed 42

This runs the complete experiment: 29 days of learning followed by the Day 30 elimination test. Results are exported to analysis/ automatically.

Quick Test (Dry Run)

To verify everything is wired up without spending API credits:

python main.py --days 2 --dry-run --seed 42

This runs 2 simulated days with placeholder LLM responses. No API calls are made. Useful for checking that the pipeline doesn't crash.

Fast Mode

Reduces peer conversation exchanges from 8-12 per pair down to 4-6. Cuts runtime significantly while keeping the structure intact:

python main.py --days 30 --seed 42 --speed fast

Resume from a Specific Day

If the simulation crashes or you interrupt it, resume from the last completed day. Knowledge stores are persisted to disk at the end of every day, so nothing is lost:

python main.py --days 30 --seed 42 --start-day 15

This picks up from Day 15 and runs through Day 30.

Override the Model

Test with a different model without changing .env:

python main.py --days 30 --seed 42 --model-override gpt-4o

Verbose Logging

See every LLM call, every knowledge mutation, every phase transition:

python main.py --days 30 --seed 42 --log-level INFO

Use DEBUG for even more detail (very noisy).

Shell Script

On Linux/Mac, the helper script installs dependencies and runs the simulation:

chmod +x run.sh
./run.sh --days 30 --seed 42

All Command-Line Options

Flag Default Description
--days N 30 Number of days to simulate
--start-day N 1 Resume from a specific day
--seed N 42 Random seed for reproducibility
--speed fast normal Reduces peer exchanges (4-6 instead of 8-12)
--model-override MODEL (from .env) Override all agent models
--dry-run off Run without making LLM calls (placeholder responses)
--no-export off Skip exporting analysis files after simulation
--log-level LEVEL WARNING Logging verbosity: DEBUG, INFO, WARNING, ERROR
--data-dir PATH ./data Override data directory

Before Re-Running

Always clean previous data before starting a fresh simulation. Old data will contaminate new results:

rm -rf data/ analysis/ csv_export/

Ablation Testing

Ablation flags let you disable individual components to measure their contribution. Run the full simulation with one component turned off, then compare results against a full run.

Flag What It Disables
--ablation-no-knowledge Knowledge injection at test time (agents answer from raw model only)
--ablation-no-peers Peer conversation phase (no inter-agent dialogue)
--ablation-no-axioms Axiom validation pipeline (proposals are never evaluated)
--ablation-no-consolidation Periodic knowledge consolidation (no cluster merging)
--ablation-no-sharing Inter-agent knowledge sharing (agents learn in isolation)

Example: Measuring the Value of Peer Conversations

# Full run (control)
rm -rf data/ analysis/
python main.py --days 30 --seed 42 --data-dir data_full

# No peer conversations (experiment)
rm -rf data/ analysis/
python main.py --days 30 --seed 42 --ablation-no-peers --data-dir data_no_peers

Then compare data_full/simulation.db and data_no_peers/simulation.db test scores.


Modifying the Experiment

Changing Knowledge Store Limits

Edit the capacity constants at the top of config.py:

# Tiny stores (aggressive overflow, maximum forgetting pressure)
IMPULSE_MAX_ENTRIES = 20
DEEP_MAX_ENTRIES = 50
AXIOM_MAX_ENTRIES = 30

# Large stores (minimal overflow, tests learning quality without forgetting)
IMPULSE_MAX_ENTRIES = 500
DEEP_MAX_ENTRIES = 2000
AXIOM_MAX_ENTRIES = 500

The token-per-entry limits control how much text each knowledge entry can hold:

IMPULSE_MAX_TOKENS = 100    # Word count (not LLM tokens). ~2-3 sentences.
DEEP_MAX_TOKENS = 500       # ~10-15 sentences. Full reasoning chains.
AXIOM_MAX_TOKENS = 250      # ~5-8 sentences.

Changing the Model

Option 1 --- edit .env:

OPENAI_MODEL=gpt-4o

Option 2 --- command line:

python main.py --model-override gpt-4o

Option 3 --- per-agent models (edit config.py to give each agent a different model):

AGENT_CONFIG = {
    "alpha": {"model": "gpt-4o-mini",  "api_key": API_KEY},
    "beta":  {"model": "gpt-4o",       "api_key": API_KEY},
    "gamma": {"model": "gpt-4o",       "api_key": API_KEY},
}

Using a Non-OpenAI Endpoint

Any OpenAI-compatible API works. Change API_BASE in config.py:

# Azure OpenAI
API_BASE = "https://your-resource.openai.azure.com/openai/deployments/your-deployment"

# Local LLM (e.g., Ollama, vLLM, LM Studio)
API_BASE = "http://localhost:11434/v1"

# NVIDIA NIM
API_BASE = "https://integrate.api.nvidia.com/v1"

Changing Agent Personalities

Each agent's behavior is controlled by two things:

  1. System prompt --- the *_SYSTEM_PROMPT string at the top of each agent file (agents/alpha.py, agents/beta.py, agents/gamma.py). This tells the LLM who it is and how to behave. Edit this text to change the agent's personality.

  2. Store logic --- the _default_store_answer() and absorb_lecture() methods in each agent file. These control how the agent decides what to store and where. For example, Alpha's _default_store_answer() asks the LLM to extract quick facts for impulse memory, while Beta's asks for deep reasoning chains.

To create a fundamentally different agent (e.g., a "Skeptic" that doubts everything):

  1. Copy agents/alpha.py to agents/skeptic.py
  2. Rename the class to SkepticAgent
  3. Rewrite the system prompt and storage logic
  4. Register it in main.py and config.py

Adding a Fourth Agent

  1. Create agents/delta.py following the pattern of the other agents (extend BaseAgent, implement build_system_prompt, _default_store_answer, absorb_lecture, manage_knowledge)
  2. Add "delta" to AGENT_CONFIG in config.py
  3. Add "delta" to PEER_PAIRS in config.py (e.g., add ("alpha", "delta"), ("delta", "gamma"))
  4. Import and instantiate DeltaAgent in main.py's agent_classes dict
  5. Create data/delta/ directory (done automatically at runtime)

Changing the Curriculum

The curriculum lives in config.py as the TOPIC_SCHEDULE list. Each entry is a long string describing one day's material with numbered subtopics. Day 30 must be "FINAL_TEST".

To modify a day's topic, edit the corresponding string. To add more days, append entries to the list and increase DEFAULT_DAYS.

Example --- replacing Day 1 with a custom topic:

TOPIC_SCHEDULE = [
    # Day 1 — Your Custom Topic
    "Your Custom Topic\n"
    "1. Subtopic A — detailed description of what to cover\n"
    "2. Subtopic B — detailed description\n"
    "3. Subtopic C — detailed description",

    # Day 2 — keep the rest...
    ...
]

The subtopic format matters: the system splits on numbered lines (1., 2., etc.) to generate individual lectures. Each numbered subtopic becomes a separate oracle lecture call.

Changing the Elimination Threshold

Edit PASS_THRESHOLD in config.py. Default is 0.6 (60%):

PASS_THRESHOLD = 0.7   # Harder: 70% to survive
PASS_THRESHOLD = 0.5   # Easier: 50% to survive

Changing How Many Questions Per Day / Peer Exchanges

LEARNING_QUESTIONS_PER_AGENT = 5    # Follow-up questions each agent asks per day
PEER_EXCHANGE_RANGE = (8, 12)       # Min/max messages per peer conversation pair
PEER_EXCHANGE_RANGE_FAST = (4, 6)   # Same, in --speed fast mode

Changing the Knowledge Injection Budget

These control how much stored knowledge is injected into prompts:

DAILY_KNOWLEDGE_BUDGET_WORDS = 2500   # During learning phases (~3K LLM tokens)
EXAM_KNOWLEDGE_BUDGET_WORDS = 4000    # During the Day 30 test (~5K LLM tokens)

If using a model with a large context window, you can increase these. If using a small-context model, decrease them to avoid truncation.

Changing the Scoring Rubric

The scoring prompt lives in simulation/evaluator.py in the score_answer() method. Edit the system prompt string to change how strictly or leniently the evaluator grades:

system = (
    "You are a STRICT academic exam grader. Grade the answer 0-10 using this rubric:\n"
    "  1-2: Wrong, irrelevant, or nonsensical answer\n"
    ...
)

Changing Rate Limits

If you hit API rate limits frequently, increase the throttle:

MIN_CALL_INTERVAL_SECONDS = 1.0   # Seconds between calls (per model)
RETRY_BASE_DELAY = 3              # Base delay on retry (multiplied by attempt number)
RETRY_MAX_DELAY = 60              # Max delay cap
MAX_RETRIES = 999                 # Effectively unlimited

Changing the Eviction Strategy

The eviction algorithm lives in knowledge/store.py in the _utility_score() method:

def _utility_score(self, entry: dict, day: int) -> float:
    access = entry.get("access_count", 0)
    confidence = entry.get("confidence", 0.5)
    age = max(1, day - entry.get("created_day", 0))
    recency = 1.0 / age
    return (access * 0.4) + (confidence * 0.3) + (recency * 0.3)

The entry with the lowest utility score gets evicted. Adjust the weights to change what gets forgotten first:

  • Increase the access weight to protect frequently-accessed entries
  • Increase the recency weight to protect newer entries
  • Increase the confidence weight to protect high-confidence entries

Changing the Deduplication Threshold

In knowledge/store.py:

DEDUP_THRESHOLD = 0.7   # Jaccard similarity >= this = duplicate (raise to allow more similar entries)

Analyzing Results

Output Files

After a simulation run, these files contain the results:

File Contents
analysis/survival_report.md Final verdict: who survived, who was eliminated, scores by question type
analysis/summary_stats.json Per-agent token usage, store sizes, overflow counts, growth curves over 30 days
analysis/knowledge_flow.json Every knowledge mutation as a flow event (useful for Sankey diagrams)
data/simulation.db Complete SQLite database with everything logged
data/{agent}/impulse.json Final state of each agent's impulse store
data/{agent}/deep_thinking.json Final state of each agent's deep thinking store
data/{agent}/axiom.json Final state of each agent's axiom store

Querying the Database

The SQLite database at data/simulation.db contains six tables. You can query it directly:

# How many LLM calls were made?
sqlite3 data/simulation.db "SELECT COUNT(*) FROM interactions;"

# Token usage per agent
sqlite3 data/simulation.db "
  SELECT agent, SUM(tokens_in) as input, SUM(tokens_out) as output
  FROM interactions GROUP BY agent;
"

# Test scores per agent
sqlite3 data/simulation.db "
  SELECT agent, COUNT(*) as questions, SUM(score) as total,
         ROUND(SUM(score) / (COUNT(*) * 10.0) * 100, 1) as pct
  FROM test_results GROUP BY agent;
"

# Test scores by question type
sqlite3 data/simulation.db "
  SELECT agent, question_type, ROUND(AVG(score), 2) as avg_score
  FROM test_results GROUP BY agent, question_type;
"

# How many overflow events per agent per store?
sqlite3 data/simulation.db "
  SELECT agent, store_type, COUNT(*) as deletions
  FROM overflow_events GROUP BY agent, store_type ORDER BY deletions DESC;
"

# Axiom acceptance rate
sqlite3 data/simulation.db "
  SELECT mutation_type, COUNT(*) as count
  FROM knowledge_mutations
  WHERE mutation_type IN ('axiom_accepted', 'axiom_rejected')
  GROUP BY mutation_type;
"

# Knowledge growth: daily store sizes
sqlite3 data/simulation.db "
  SELECT day, agent, store_type, entry_count
  FROM snapshots ORDER BY day, agent, store_type;
"

# Peer conversation exchange counts per day
sqlite3 data/simulation.db "
  SELECT day, agent_a, agent_b, num_exchanges
  FROM conversations ORDER BY day;
"

# What knowledge was lost to overflow on a specific day?
sqlite3 data/simulation.db "
  SELECT agent, store_type, deleted_content_preview
  FROM overflow_events WHERE day = 10;
"

Reading Knowledge Stores

The JSON knowledge stores are human-readable:

# How many entries does Alpha have in impulse?
python -c "import json; d=json.load(open('data/alpha/impulse.json')); print(len(d))"

# Print Alpha's axioms
python -c "
import json
axioms = json.load(open('data/alpha/axiom.json'))
for a in axioms:
    print(f\"Day {a['created_day']}: {a['content'][:100]}\")
"

Comparing Two Runs

Run the simulation twice with different settings, using --data-dir to keep them separate:

rm -rf data_run1 data_run2

python main.py --days 30 --seed 42 --data-dir data_run1
python main.py --days 30 --seed 42 --ablation-no-peers --data-dir data_run2

Then compare test results:

echo "=== Run 1 (full) ==="
sqlite3 data_run1/simulation.db "SELECT agent, SUM(score), ROUND(SUM(score)/(COUNT(*)*10.0)*100,1) FROM test_results GROUP BY agent;"

echo "=== Run 2 (no peers) ==="
sqlite3 data_run2/simulation.db "SELECT agent, SUM(score), ROUND(SUM(score)/(COUNT(*)*10.0)*100,1) FROM test_results GROUP BY agent;"

Database Schema

All events are logged to data/simulation.db (SQLite):

interactions

Every LLM call made during the simulation.

Column Type Description
day INTEGER Simulation day (1-30)
phase TEXT WAKE, TEACHING, LEARNING, PEER_CONVERSATION, etc.
agent TEXT alpha, beta, gamma, or oracle
action TEXT What the call was for (e.g., ask_q3, reply_beta, lecture)
prompt_preview TEXT First 500 chars of the system prompt
response_preview TEXT First 500 chars of the LLM response
tokens_in INTEGER Prompt tokens consumed
tokens_out INTEGER Completion tokens generated
latency_ms INTEGER Response time in milliseconds
model TEXT Model used for this call

knowledge_mutations

Every knowledge store change.

Column Type Description
day INTEGER When the mutation occurred
agent TEXT Which agent's store changed
store_type TEXT impulse, deep_thinking, or axiom
mutation_type TEXT add, discard, promote, axiom_accepted, axiom_rejected
entry_id TEXT UUID of the affected entry
content_preview TEXT First 200 chars of the entry content

conversations

Full peer conversation transcripts.

Column Type Description
day INTEGER Simulation day
agent_a / agent_b TEXT The two agents in the conversation
topic TEXT The day's curriculum topic
transcript_json TEXT Full conversation as JSON array
num_exchanges INTEGER Number of back-and-forth messages

snapshots

Daily knowledge store snapshots.

Column Type Description
day INTEGER Simulation day
agent TEXT Agent name
store_type TEXT impulse, deep_thinking, or axiom
entry_count INTEGER Number of entries at end of day
entries_json TEXT Full dump of all entries as JSON

test_results

Day 30 test answers and scores.

Column Type Description
agent TEXT alpha, beta, gamma, or solo_baseline
question_number INTEGER Question number (1-30)
question_type TEXT impulse, deep, or axiom
question TEXT The full question text
answer TEXT The agent's full answer
score REAL Score from evaluator (0-10)
score_reasoning TEXT Evaluator's explanation for the score

overflow_events

Every silent knowledge deletion due to store overflow.

Column Type Description
day INTEGER When the deletion happened
agent TEXT Which agent lost knowledge
store_type TEXT Which store overflowed
deleted_entry_id TEXT UUID of the deleted entry
deleted_content_preview TEXT First 200 chars of what was lost

What Success Looks Like

The experiment succeeds if:

  1. Agents with knowledge outscore the solo baseline --- proving the 29-day learning process added value beyond raw model capabilities
  2. Different cognitive strategies produce measurably different results --- proving the personality architecture matters
  3. Overflow events create genuine knowledge management pressure --- proving the constrained stores force meaningful trade-offs
  4. Axiom validation filters out low-quality content --- proving Gamma's gatekeeping adds quality control
  5. Peer conversations transfer useful knowledge --- proving multi-agent interaction beats isolated learning

The experiment also produces a rich dataset for analysis: token usage patterns, knowledge growth curves, overflow rates, axiom acceptance rates, conversation transcripts, and per-question scoring breakdowns.


Cost Estimate

With gpt-4o-mini at ~$0.15/1M input tokens and ~$0.60/1M output tokens:

  • A full 30-day run generates approximately 10-13M tokens
  • Estimated cost: $3-8 USD depending on conversation length and retries
  • Runtime: 2-5 hours (single-threaded, rate-limited)

With --speed fast: roughly half the runtime, similar cost (peer conversations are a small fraction of total tokens).

Using gpt-4o or other frontier models will cost 10-50x more but may produce higher-quality learning and more discriminating test scores.


Design Decisions

Why Custom TF-IDF (No scikit-learn)

The knowledge stores use a hand-rolled TF-IDF cosine similarity implementation (~50 lines in knowledge/store.py). This avoids adding numpy/scipy as dependencies and is adequate for the store sizes used.

Why Word-Count Tokenization

Token limits are approximated using len(content.split()). This is intentionally simple and model-agnostic. What matters is consistent enforcement of relative limits (impulse < axiom < deep), not exact token counts.

Why Synchronous Single-Threaded

The entire simulation runs synchronously to ensure deterministic phase ordering, reproducible results given the same seed, and no race conditions on knowledge stores. The tradeoff is slower execution.

Why No Meta-Agent Controller

Alpha, Beta, and Gamma are true peers. The SimulationEnvironment class orchestrates phase timing but never reads or interprets agent outputs. It simply routes messages.

Why Silent Deletion

When a store overflows, entries are deleted without notification. The agent discovers knowledge loss organically through failed recall. This is a core simulation mechanic, not a bug.

Reproducibility Caveats

The random seed controls structural decisions (which entries to evict, how many peer exchanges, which topics to sample for the test). LLM outputs are not seeded --- two runs with the same seed will have the same structure but different LLM-generated content.


Troubleshooting

"No API key configured"

Make sure .env exists in the project root with OPENAI_API_KEY=sk-.... The key is loaded at import time by config.py.

Rate limit errors

The simulation retries automatically with exponential backoff (up to 60s between retries, effectively unlimited attempts). If you're hitting limits constantly, increase MIN_CALL_INTERVAL_SECONDS in config.py.

Simulation crashed mid-run

Knowledge stores are saved to disk at the end of every day. Use --start-day N to resume from the last completed day. The database will contain data from both the original and resumed runs.

Results look wrong after re-running

Old data in data/simulation.db accumulates across runs. Always rm -rf data/ analysis/ before a fresh run.

Memory usage growing

The simulation keeps all knowledge stores in memory. With the default limits (50/200/100 entries), this is negligible. If you increase limits to thousands of entries, memory usage will grow proportionally.


License

This project is an experimental research tool. Use it however you want.