A 30-day AI agent experiment that tests whether three LLM agents with distinct cognitive personalities can learn, share knowledge, and outperform a single model answering cold.
Three agents --- Alpha, Beta, and Gamma --- are given a 29-day university-grade curriculum spanning mathematics, physics, chemistry, biology, computer science, philosophy, economics, and more. They learn from lectures, ask follow-up questions, have peer conversations, share knowledge, and manage their memory stores. On Day 30, they face a 30-question elimination test. Score below 60% and you are permanently eliminated.
A solo baseline (same model, no knowledge, no persona) answers the same questions as the control group. The experiment measures whether 29 days of structured learning adds anything over the model's raw training knowledge.
LLM agents with specialized cognitive roles and constrained memory stores, forced to learn collaboratively under elimination pressure, will develop richer knowledge representations than a single model answering from its training data alone.
The experiment also tests:
- Whether different knowledge storage strategies (fast recall vs deep reasoning vs axiomatic truths) produce different outcomes
- Whether peer conversation and knowledge sharing between agents improves learning
- Whether memory overflow and forced forgetting create interesting knowledge management behaviors
- Whether a structured multi-agent system adds measurable value over a monolithic model
Fast, instinctive, pattern-matching. Stores almost everything as impulse memory (quick-recall facts). Only goes deep when something is genuinely paradigm-shifting. Alpha fires first, remembers broadly, and relies on gut-level pattern recognition.
- Primary store: impulse
- Personality file:
agents/alpha.py - System prompt defined in:
ALPHA_SYSTEM_PROMPTat the top of that file
Methodical, analytical, thorough. Stores almost everything in deep thinking memory with reasoning chains, confidence scores, and cross-domain connections. Beta doesn't memorize --- it understands. Slower but deeper.
- Primary store: deep_thinking
- Personality file:
agents/beta.py - System prompt defined in:
BETA_SYSTEM_PROMPTat the top of that file
Rigorous, principled, conservative. Evaluates every piece of knowledge for axiom-worthiness: is this a universal truth, always true, everywhere, without exception? If yes, it becomes an axiom. If not, it gets categorized as deep or impulse. Gamma also serves as the final validator for all axiom proposals from the other agents --- even its own candidates face the same scrutiny.
- Primary store: axiom
- Personality file:
agents/gamma.py - System prompt defined in:
GAMMA_SYSTEM_PROMPTat the top of that file - Axiom validation pipeline:
validate_axiom()method inagents/gamma.py
Each agent maintains three separate knowledge stores:
| Store | Max Entries | Max Tokens/Entry | Purpose |
|---|---|---|---|
| Impulse | 50 | 100 | Quick-recall facts, definitions, constants |
| Deep Thinking | 200 | 500 | Reasoning chains, multi-step logic, cross-domain connections |
| Axiom | 100 | 250 | Universal truths, foundational principles, validated theorems |
When a store reaches capacity, the entry with the lowest utility score (combination of access frequency, confidence, and recency) is silently deleted. The agent receives no notification --- it simply won't find that knowledge on next recall. This simulates how memories degrade under information overload.
Every deletion is logged to the overflow_events database table for post-hoc analysis.
Before adding any entry, the system computes a normalized fingerprint (sorted unique tokens) and checks Jaccard similarity against all existing entries. Entries with similarity >= 0.7 are rejected as duplicates.
Retrieval uses a custom TF-IDF implementation with unigram + bigram matching and cosine similarity. No external ML libraries required.
Each of the 29 learning days follows this phase cycle:
WAKE --> TEACHING --> LEARNING --> PEER CONVERSATION --> KNOWLEDGE SHARING --> KNOWLEDGE MANAGEMENT --> SLEEP
System briefing with elimination countdown. Progressive urgency increases as Day 30 approaches. Agents receive adaptive study guidance based on their weak topics (meta-learning).
An oracle LLM generates comprehensive lectures for the day's curriculum topic, split into subtopics. All agents absorb each lecture according to their cognitive personality --- Alpha stores quick facts, Beta builds reasoning chains, Gamma evaluates for axiom-worthiness.
Each agent asks 5 follow-up questions targeting gaps in their knowledge. The oracle answers each question, and agents store the answers per their persona.
Agents pair up (Alpha-Beta, Beta-Gamma, Alpha-Gamma) for 8-12 message exchanges per pair. They share insights, challenge each other's understanding, and learn from opposing cognitive perspectives.
Smart inter-agent knowledge transfer. The system checks fingerprint similarity before sharing --- only genuinely new knowledge gets transferred. High-access impulse entries get promoted to deep thinking. High-confidence deep entries become axiom candidates.
Agents review and manage their stores. Axiom proposals from all agents are collected and sent through Gamma's validation pipeline:
- The candidate axiom is evaluated for universality
- Gamma checks for conflicts with existing axioms
- On conflict, Alpha and Beta are consulted
- Gamma makes the final ruling (accept/reject)
All axioms --- including Gamma's own proposals --- must pass this pipeline.
Conversation histories are cleared (simulating sleep). Knowledge is consolidated every 5 days (clusters of 3+ similar entries are merged into distilled summaries). All stores are persisted to disk and snapshotted to SQLite.
On Day 30, the simulation generates 30 test questions from the 29-day curriculum:
- 10 impulse questions: Quick-recall factual questions (1-2 sentence answers)
- 10 deep questions: Multi-step analytical questions requiring reasoning across concepts
- 10 axiom questions: True/false claims about fundamental principles, requiring justification
Each agent answers all 30 questions individually using their accumulated knowledge stores.
A solo baseline answers the same 30 questions using the same underlying model but with no knowledge stores and no agent persona. This is the control group.
An evaluator LLM grades each answer on a 1-10 scale using a strict rubric:
| Score | Meaning |
|---|---|
| 1-2 | Wrong, irrelevant, or nonsensical |
| 3-4 | Partially correct but major errors or critical omissions |
| 5-6 | Correct core idea but shallow, vague, or missing important details |
| 7-8 | Mostly correct and well-reasoned, minor gaps |
| 9 | Excellent --- accurate, thorough, and well-structured |
| 10 | Perfect --- flawless, comprehensive, demonstrates deep mastery |
Agents scoring below 60% overall (18/30 questions) are permanently eliminated.
The curriculum covers university-grade material across all major academic domains:
| Days | Domain | Topics |
|---|---|---|
| 1-3 | Pure Mathematics | Foundations/Logic/Number Theory, Algebra/Linear Algebra, Analysis/Topology |
| 4-6 | Physics | Classical Mechanics/Thermo, Electromagnetism/Relativity, Quantum/Particle |
| 7 | Chemistry | Atomic Structure through Biochemistry |
| 8-9 | Biology | Molecular Bio/Genetics/Evolution, Physiology/Ecology/Earth Sciences |
| 10-11 | Computer Science | Algorithms/Complexity, Systems/Networking/Databases |
| 12 | Information Theory | Probability, Statistics, ML Theory |
| 13-14 | Philosophy | Ancient through Modern, Analytic/Continental/Contemporary |
| 15 | Ethics & Law | Metaethics, Normative Ethics, Political Philosophy, Jurisprudence |
| 16 | Philosophy of Mind | Consciousness, Cognitive Science, Free Will |
| 17-18 | Economics | Micro/Macro, Game Theory, Behavioral Economics |
| 19-20 | Psychology & Linguistics | Cognitive/Social Psychology, Syntax/Semantics/Pragmatics |
| 21-22 | History & Religion | World History, Comparative Religion |
| 23-24 | Arts & Media | Music Theory/Visual Arts, Film/Media/Digital Culture |
| 25-26 | Advanced STEM | Causal Inference, Unsolved Problems (P=NP, Riemann, etc.) |
| 27 | Interdisciplinary | Complex Systems, Network Science, Chaos Theory |
| 28 | Applied | Engineering, Medicine, Agriculture, Urban Planning |
| 29 | Meta & Integration | Epistemology, Research Methods, Cross-Domain Synthesis |
| 30 | FINAL TEST | 30 questions across all domains |
Each day's topic contains 6-7 detailed subtopics, each generating a comprehensive lecture.
Trifecta/
main.py # Entry point --- run the full simulation
config.py # All configuration: models, limits, curriculum, parameters
requirements.txt # Python dependencies
run.sh # Shell helper script
.env.example # Template for API keys
.gitignore
agents/
base_agent.py # Abstract base: LLM calls, knowledge injection, test answers
alpha.py # Alpha: impulsive, pattern-matching
beta.py # Beta: deep analytical thinker
gamma.py # Gamma: axiom guardian and validator
knowledge/
store.py # KnowledgeStore: add, evict, retrieve (TF-IDF), deduplicate
simulation/
environment.py # Day loop orchestrator: phases, agent coordination
question_oracle.py # Oracle LLM: generates lectures and answers questions
communication.py # Peer conversation bus: agent-to-agent exchanges
curriculum.py # Topic parser: splits curriculum entries into subtopics
curriculum_test.py # Day 30 test: generates questions, tests agents, baseline
evaluator.py # Scoring LLM: generates questions, grades answers 1-10
sim_logging/
db.py # SQLite logger: interactions, mutations, conversations, snapshots
export.py # Post-sim export: summary stats, knowledge flow, survival report
data/ # Generated at runtime (gitignored)
alpha/ # impulse.json, deep_thinking.json, axiom.json
beta/ # impulse.json, deep_thinking.json, axiom.json
gamma/ # impulse.json, deep_thinking.json, axiom.json
simulation.db # SQLite database with all logged events
analysis/ # Generated after simulation (gitignored)
summary_stats.json # Per-agent token usage, store sizes, growth curves
knowledge_flow.json # Knowledge mutation flow data
survival_report.md # Final test results and survival verdicts
- Python 3.10+
- An OpenAI API key (or any OpenAI-compatible API endpoint)
git clone <repo-url>
cd Trifecta
# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt- Copy the example environment file and add your API key:
cp .env.example .env- Edit
.env:
OPENAI_API_KEY=sk-your-key-here
OPENAI_MODEL=gpt-4o-mini
The simulation uses this model for all three agents, the oracle (lectures + Q&A), and the evaluator (test scoring). gpt-4o-mini is recommended for cost efficiency.
python main.py --days 30 --seed 42This runs the complete experiment: 29 days of learning followed by the Day 30 elimination test. Results are exported to analysis/ automatically.
To verify everything is wired up without spending API credits:
python main.py --days 2 --dry-run --seed 42This runs 2 simulated days with placeholder LLM responses. No API calls are made. Useful for checking that the pipeline doesn't crash.
Reduces peer conversation exchanges from 8-12 per pair down to 4-6. Cuts runtime significantly while keeping the structure intact:
python main.py --days 30 --seed 42 --speed fastIf the simulation crashes or you interrupt it, resume from the last completed day. Knowledge stores are persisted to disk at the end of every day, so nothing is lost:
python main.py --days 30 --seed 42 --start-day 15This picks up from Day 15 and runs through Day 30.
Test with a different model without changing .env:
python main.py --days 30 --seed 42 --model-override gpt-4oSee every LLM call, every knowledge mutation, every phase transition:
python main.py --days 30 --seed 42 --log-level INFOUse DEBUG for even more detail (very noisy).
On Linux/Mac, the helper script installs dependencies and runs the simulation:
chmod +x run.sh
./run.sh --days 30 --seed 42| Flag | Default | Description |
|---|---|---|
--days N |
30 | Number of days to simulate |
--start-day N |
1 | Resume from a specific day |
--seed N |
42 | Random seed for reproducibility |
--speed fast |
normal | Reduces peer exchanges (4-6 instead of 8-12) |
--model-override MODEL |
(from .env) | Override all agent models |
--dry-run |
off | Run without making LLM calls (placeholder responses) |
--no-export |
off | Skip exporting analysis files after simulation |
--log-level LEVEL |
WARNING | Logging verbosity: DEBUG, INFO, WARNING, ERROR |
--data-dir PATH |
./data | Override data directory |
Always clean previous data before starting a fresh simulation. Old data will contaminate new results:
rm -rf data/ analysis/ csv_export/Ablation flags let you disable individual components to measure their contribution. Run the full simulation with one component turned off, then compare results against a full run.
| Flag | What It Disables |
|---|---|
--ablation-no-knowledge |
Knowledge injection at test time (agents answer from raw model only) |
--ablation-no-peers |
Peer conversation phase (no inter-agent dialogue) |
--ablation-no-axioms |
Axiom validation pipeline (proposals are never evaluated) |
--ablation-no-consolidation |
Periodic knowledge consolidation (no cluster merging) |
--ablation-no-sharing |
Inter-agent knowledge sharing (agents learn in isolation) |
# Full run (control)
rm -rf data/ analysis/
python main.py --days 30 --seed 42 --data-dir data_full
# No peer conversations (experiment)
rm -rf data/ analysis/
python main.py --days 30 --seed 42 --ablation-no-peers --data-dir data_no_peersThen compare data_full/simulation.db and data_no_peers/simulation.db test scores.
Edit the capacity constants at the top of config.py:
# Tiny stores (aggressive overflow, maximum forgetting pressure)
IMPULSE_MAX_ENTRIES = 20
DEEP_MAX_ENTRIES = 50
AXIOM_MAX_ENTRIES = 30
# Large stores (minimal overflow, tests learning quality without forgetting)
IMPULSE_MAX_ENTRIES = 500
DEEP_MAX_ENTRIES = 2000
AXIOM_MAX_ENTRIES = 500The token-per-entry limits control how much text each knowledge entry can hold:
IMPULSE_MAX_TOKENS = 100 # Word count (not LLM tokens). ~2-3 sentences.
DEEP_MAX_TOKENS = 500 # ~10-15 sentences. Full reasoning chains.
AXIOM_MAX_TOKENS = 250 # ~5-8 sentences.Option 1 --- edit .env:
OPENAI_MODEL=gpt-4o
Option 2 --- command line:
python main.py --model-override gpt-4oOption 3 --- per-agent models (edit config.py to give each agent a different model):
AGENT_CONFIG = {
"alpha": {"model": "gpt-4o-mini", "api_key": API_KEY},
"beta": {"model": "gpt-4o", "api_key": API_KEY},
"gamma": {"model": "gpt-4o", "api_key": API_KEY},
}Any OpenAI-compatible API works. Change API_BASE in config.py:
# Azure OpenAI
API_BASE = "https://your-resource.openai.azure.com/openai/deployments/your-deployment"
# Local LLM (e.g., Ollama, vLLM, LM Studio)
API_BASE = "http://localhost:11434/v1"
# NVIDIA NIM
API_BASE = "https://integrate.api.nvidia.com/v1"Each agent's behavior is controlled by two things:
-
System prompt --- the
*_SYSTEM_PROMPTstring at the top of each agent file (agents/alpha.py,agents/beta.py,agents/gamma.py). This tells the LLM who it is and how to behave. Edit this text to change the agent's personality. -
Store logic --- the
_default_store_answer()andabsorb_lecture()methods in each agent file. These control how the agent decides what to store and where. For example, Alpha's_default_store_answer()asks the LLM to extract quick facts for impulse memory, while Beta's asks for deep reasoning chains.
To create a fundamentally different agent (e.g., a "Skeptic" that doubts everything):
- Copy
agents/alpha.pytoagents/skeptic.py - Rename the class to
SkepticAgent - Rewrite the system prompt and storage logic
- Register it in
main.pyandconfig.py
- Create
agents/delta.pyfollowing the pattern of the other agents (extendBaseAgent, implementbuild_system_prompt,_default_store_answer,absorb_lecture,manage_knowledge) - Add
"delta"toAGENT_CONFIGinconfig.py - Add
"delta"toPEER_PAIRSinconfig.py(e.g., add("alpha", "delta"),("delta", "gamma")) - Import and instantiate
DeltaAgentinmain.py'sagent_classesdict - Create
data/delta/directory (done automatically at runtime)
The curriculum lives in config.py as the TOPIC_SCHEDULE list. Each entry is a long string describing one day's material with numbered subtopics. Day 30 must be "FINAL_TEST".
To modify a day's topic, edit the corresponding string. To add more days, append entries to the list and increase DEFAULT_DAYS.
Example --- replacing Day 1 with a custom topic:
TOPIC_SCHEDULE = [
# Day 1 — Your Custom Topic
"Your Custom Topic\n"
"1. Subtopic A — detailed description of what to cover\n"
"2. Subtopic B — detailed description\n"
"3. Subtopic C — detailed description",
# Day 2 — keep the rest...
...
]The subtopic format matters: the system splits on numbered lines (1., 2., etc.) to generate individual lectures. Each numbered subtopic becomes a separate oracle lecture call.
Edit PASS_THRESHOLD in config.py. Default is 0.6 (60%):
PASS_THRESHOLD = 0.7 # Harder: 70% to survive
PASS_THRESHOLD = 0.5 # Easier: 50% to surviveLEARNING_QUESTIONS_PER_AGENT = 5 # Follow-up questions each agent asks per day
PEER_EXCHANGE_RANGE = (8, 12) # Min/max messages per peer conversation pair
PEER_EXCHANGE_RANGE_FAST = (4, 6) # Same, in --speed fast modeThese control how much stored knowledge is injected into prompts:
DAILY_KNOWLEDGE_BUDGET_WORDS = 2500 # During learning phases (~3K LLM tokens)
EXAM_KNOWLEDGE_BUDGET_WORDS = 4000 # During the Day 30 test (~5K LLM tokens)If using a model with a large context window, you can increase these. If using a small-context model, decrease them to avoid truncation.
The scoring prompt lives in simulation/evaluator.py in the score_answer() method. Edit the system prompt string to change how strictly or leniently the evaluator grades:
system = (
"You are a STRICT academic exam grader. Grade the answer 0-10 using this rubric:\n"
" 1-2: Wrong, irrelevant, or nonsensical answer\n"
...
)If you hit API rate limits frequently, increase the throttle:
MIN_CALL_INTERVAL_SECONDS = 1.0 # Seconds between calls (per model)
RETRY_BASE_DELAY = 3 # Base delay on retry (multiplied by attempt number)
RETRY_MAX_DELAY = 60 # Max delay cap
MAX_RETRIES = 999 # Effectively unlimitedThe eviction algorithm lives in knowledge/store.py in the _utility_score() method:
def _utility_score(self, entry: dict, day: int) -> float:
access = entry.get("access_count", 0)
confidence = entry.get("confidence", 0.5)
age = max(1, day - entry.get("created_day", 0))
recency = 1.0 / age
return (access * 0.4) + (confidence * 0.3) + (recency * 0.3)The entry with the lowest utility score gets evicted. Adjust the weights to change what gets forgotten first:
- Increase the
accessweight to protect frequently-accessed entries - Increase the
recencyweight to protect newer entries - Increase the
confidenceweight to protect high-confidence entries
In knowledge/store.py:
DEDUP_THRESHOLD = 0.7 # Jaccard similarity >= this = duplicate (raise to allow more similar entries)After a simulation run, these files contain the results:
| File | Contents |
|---|---|
analysis/survival_report.md |
Final verdict: who survived, who was eliminated, scores by question type |
analysis/summary_stats.json |
Per-agent token usage, store sizes, overflow counts, growth curves over 30 days |
analysis/knowledge_flow.json |
Every knowledge mutation as a flow event (useful for Sankey diagrams) |
data/simulation.db |
Complete SQLite database with everything logged |
data/{agent}/impulse.json |
Final state of each agent's impulse store |
data/{agent}/deep_thinking.json |
Final state of each agent's deep thinking store |
data/{agent}/axiom.json |
Final state of each agent's axiom store |
The SQLite database at data/simulation.db contains six tables. You can query it directly:
# How many LLM calls were made?
sqlite3 data/simulation.db "SELECT COUNT(*) FROM interactions;"
# Token usage per agent
sqlite3 data/simulation.db "
SELECT agent, SUM(tokens_in) as input, SUM(tokens_out) as output
FROM interactions GROUP BY agent;
"
# Test scores per agent
sqlite3 data/simulation.db "
SELECT agent, COUNT(*) as questions, SUM(score) as total,
ROUND(SUM(score) / (COUNT(*) * 10.0) * 100, 1) as pct
FROM test_results GROUP BY agent;
"
# Test scores by question type
sqlite3 data/simulation.db "
SELECT agent, question_type, ROUND(AVG(score), 2) as avg_score
FROM test_results GROUP BY agent, question_type;
"
# How many overflow events per agent per store?
sqlite3 data/simulation.db "
SELECT agent, store_type, COUNT(*) as deletions
FROM overflow_events GROUP BY agent, store_type ORDER BY deletions DESC;
"
# Axiom acceptance rate
sqlite3 data/simulation.db "
SELECT mutation_type, COUNT(*) as count
FROM knowledge_mutations
WHERE mutation_type IN ('axiom_accepted', 'axiom_rejected')
GROUP BY mutation_type;
"
# Knowledge growth: daily store sizes
sqlite3 data/simulation.db "
SELECT day, agent, store_type, entry_count
FROM snapshots ORDER BY day, agent, store_type;
"
# Peer conversation exchange counts per day
sqlite3 data/simulation.db "
SELECT day, agent_a, agent_b, num_exchanges
FROM conversations ORDER BY day;
"
# What knowledge was lost to overflow on a specific day?
sqlite3 data/simulation.db "
SELECT agent, store_type, deleted_content_preview
FROM overflow_events WHERE day = 10;
"The JSON knowledge stores are human-readable:
# How many entries does Alpha have in impulse?
python -c "import json; d=json.load(open('data/alpha/impulse.json')); print(len(d))"
# Print Alpha's axioms
python -c "
import json
axioms = json.load(open('data/alpha/axiom.json'))
for a in axioms:
print(f\"Day {a['created_day']}: {a['content'][:100]}\")
"Run the simulation twice with different settings, using --data-dir to keep them separate:
rm -rf data_run1 data_run2
python main.py --days 30 --seed 42 --data-dir data_run1
python main.py --days 30 --seed 42 --ablation-no-peers --data-dir data_run2Then compare test results:
echo "=== Run 1 (full) ==="
sqlite3 data_run1/simulation.db "SELECT agent, SUM(score), ROUND(SUM(score)/(COUNT(*)*10.0)*100,1) FROM test_results GROUP BY agent;"
echo "=== Run 2 (no peers) ==="
sqlite3 data_run2/simulation.db "SELECT agent, SUM(score), ROUND(SUM(score)/(COUNT(*)*10.0)*100,1) FROM test_results GROUP BY agent;"All events are logged to data/simulation.db (SQLite):
Every LLM call made during the simulation.
| Column | Type | Description |
|---|---|---|
| day | INTEGER | Simulation day (1-30) |
| phase | TEXT | WAKE, TEACHING, LEARNING, PEER_CONVERSATION, etc. |
| agent | TEXT | alpha, beta, gamma, or oracle |
| action | TEXT | What the call was for (e.g., ask_q3, reply_beta, lecture) |
| prompt_preview | TEXT | First 500 chars of the system prompt |
| response_preview | TEXT | First 500 chars of the LLM response |
| tokens_in | INTEGER | Prompt tokens consumed |
| tokens_out | INTEGER | Completion tokens generated |
| latency_ms | INTEGER | Response time in milliseconds |
| model | TEXT | Model used for this call |
Every knowledge store change.
| Column | Type | Description |
|---|---|---|
| day | INTEGER | When the mutation occurred |
| agent | TEXT | Which agent's store changed |
| store_type | TEXT | impulse, deep_thinking, or axiom |
| mutation_type | TEXT | add, discard, promote, axiom_accepted, axiom_rejected |
| entry_id | TEXT | UUID of the affected entry |
| content_preview | TEXT | First 200 chars of the entry content |
Full peer conversation transcripts.
| Column | Type | Description |
|---|---|---|
| day | INTEGER | Simulation day |
| agent_a / agent_b | TEXT | The two agents in the conversation |
| topic | TEXT | The day's curriculum topic |
| transcript_json | TEXT | Full conversation as JSON array |
| num_exchanges | INTEGER | Number of back-and-forth messages |
Daily knowledge store snapshots.
| Column | Type | Description |
|---|---|---|
| day | INTEGER | Simulation day |
| agent | TEXT | Agent name |
| store_type | TEXT | impulse, deep_thinking, or axiom |
| entry_count | INTEGER | Number of entries at end of day |
| entries_json | TEXT | Full dump of all entries as JSON |
Day 30 test answers and scores.
| Column | Type | Description |
|---|---|---|
| agent | TEXT | alpha, beta, gamma, or solo_baseline |
| question_number | INTEGER | Question number (1-30) |
| question_type | TEXT | impulse, deep, or axiom |
| question | TEXT | The full question text |
| answer | TEXT | The agent's full answer |
| score | REAL | Score from evaluator (0-10) |
| score_reasoning | TEXT | Evaluator's explanation for the score |
Every silent knowledge deletion due to store overflow.
| Column | Type | Description |
|---|---|---|
| day | INTEGER | When the deletion happened |
| agent | TEXT | Which agent lost knowledge |
| store_type | TEXT | Which store overflowed |
| deleted_entry_id | TEXT | UUID of the deleted entry |
| deleted_content_preview | TEXT | First 200 chars of what was lost |
The experiment succeeds if:
- Agents with knowledge outscore the solo baseline --- proving the 29-day learning process added value beyond raw model capabilities
- Different cognitive strategies produce measurably different results --- proving the personality architecture matters
- Overflow events create genuine knowledge management pressure --- proving the constrained stores force meaningful trade-offs
- Axiom validation filters out low-quality content --- proving Gamma's gatekeeping adds quality control
- Peer conversations transfer useful knowledge --- proving multi-agent interaction beats isolated learning
The experiment also produces a rich dataset for analysis: token usage patterns, knowledge growth curves, overflow rates, axiom acceptance rates, conversation transcripts, and per-question scoring breakdowns.
With gpt-4o-mini at ~$0.15/1M input tokens and ~$0.60/1M output tokens:
- A full 30-day run generates approximately 10-13M tokens
- Estimated cost: $3-8 USD depending on conversation length and retries
- Runtime: 2-5 hours (single-threaded, rate-limited)
With --speed fast: roughly half the runtime, similar cost (peer conversations are a small fraction of total tokens).
Using gpt-4o or other frontier models will cost 10-50x more but may produce higher-quality learning and more discriminating test scores.
The knowledge stores use a hand-rolled TF-IDF cosine similarity implementation (~50 lines in knowledge/store.py). This avoids adding numpy/scipy as dependencies and is adequate for the store sizes used.
Token limits are approximated using len(content.split()). This is intentionally simple and model-agnostic. What matters is consistent enforcement of relative limits (impulse < axiom < deep), not exact token counts.
The entire simulation runs synchronously to ensure deterministic phase ordering, reproducible results given the same seed, and no race conditions on knowledge stores. The tradeoff is slower execution.
Alpha, Beta, and Gamma are true peers. The SimulationEnvironment class orchestrates phase timing but never reads or interprets agent outputs. It simply routes messages.
When a store overflows, entries are deleted without notification. The agent discovers knowledge loss organically through failed recall. This is a core simulation mechanic, not a bug.
The random seed controls structural decisions (which entries to evict, how many peer exchanges, which topics to sample for the test). LLM outputs are not seeded --- two runs with the same seed will have the same structure but different LLM-generated content.
Make sure .env exists in the project root with OPENAI_API_KEY=sk-.... The key is loaded at import time by config.py.
The simulation retries automatically with exponential backoff (up to 60s between retries, effectively unlimited attempts). If you're hitting limits constantly, increase MIN_CALL_INTERVAL_SECONDS in config.py.
Knowledge stores are saved to disk at the end of every day. Use --start-day N to resume from the last completed day. The database will contain data from both the original and resumed runs.
Old data in data/simulation.db accumulates across runs. Always rm -rf data/ analysis/ before a fresh run.
The simulation keeps all knowledge stores in memory. With the default limits (50/200/100 entries), this is negligible. If you increase limits to thousands of entries, memory usage will grow proportionally.
This project is an experimental research tool. Use it however you want.