Research Guide
This guide covers scientific methodology for conducting rigorous research with SimAgents.
Research Philosophy
SimAgents is designed for studying emergent AI behavior in multi-agent environments. Key principles:
- Reproducibility: Every experiment can be replicated with seed + configuration
- Observability: All state changes are logged and queryable
- Comparability: Standardized metrics enable cross-study comparison
- Minimal Imposition: System provides physics, not strategies
Designing Experiments
Experiment DSL
Define experiments in YAML:
name: "resource_scarcity_cooperation"
description: "Test cooperation emergence under resource scarcity"
seed: 12345
world:
size: [100, 100]
biomes:
desert: 0.7
plains: 0.2
forest: 0.1
agents:
- type: claude
count: 5
- type: gemini
count: 5
- type: baseline_random
count: 5
duration: 1000 # ticks
metrics:
- gini
- cooperation_index
- survival_rate
- clustering_coefficient
snapshots:
interval: 100 # Save state every 100 ticks
shocks:
- tick: 500
type: economic
params:
currencyChange: -0.5 # 50% currency destruction
Running Experiments
cd apps/server
# Validate configuration
bun run src/experiments/runner.ts --dry-run experiments/my-experiment.yaml
# Run experiment
bun run src/experiments/runner.ts experiments/my-experiment.yaml
# Run with custom output
bun run src/experiments/runner.ts experiments/my-experiment.yaml --output results/
Batch Experiments
Run multiple seeds for statistical significance:
for seed in 12345 23456 34567 45678 56789; do
bun run src/experiments/runner.ts experiments/my-experiment.yaml --seed $seed
done
Baseline Agents
For valid hypothesis testing, compare LLM agents against baselines:
Random Walk (Null Hypothesis)
agents:
- type: baseline_random
count: 10
Actions chosen uniformly at random. Establishes minimum performance baseline.
Rule-Based (Classical AI)
agents:
- type: baseline_rule
count: 10
Hardcoded heuristics: eat when hungry, sleep when tired, gather when near resources.
Q-Learning (Reinforcement Learning)
agents:
- type: baseline_qlearning
count: 10
Tabular Q-learning with survival reward. Tests LLM vs traditional RL.
Metrics
Economic Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| Gini Coefficient | Standard Gini on agent balances | 0 = equality, 1 = one agent has all |
| Wealth Variance | σ² of agent balances | Higher = more inequality |
| Trade Volume | Successful trades per tick | Higher = more activity |
Social Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| Cooperation Index | f(trades, trust, clustering) | 0-1, higher = more cooperation |
| Clustering Coefficient | Spatial agent grouping | Higher = agents form groups |
| Conflict Rate | Harm/steal actions per tick | Higher = more conflict |
Emergence Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| Emergence Index | (systemComplexity - Σ agentComplexity) / systemComplexity | Higher = more emergent behavior |
| Role Crystallization | Consistency of agent roles over time | Higher = stable social roles |
Survival Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| Survival Rate | Alive agents / initial agents | By LLM type |
| Mean Lifetime | Average ticks survived | Longer = better strategies |
Reproducibility
Seed Management
Every random operation uses a seeded PRNG:
seed: 12345 // In experiment config
State Snapshots
Capture complete world state:
snapshots:
interval: 100
include:
- agents
- resources
- relationships
- events
Event Sourcing
All state changes recorded. Replay any moment:
curl http://localhost:3000/api/replay/tick/150
Statistical Analysis
Recommended Approach
- Multiple Seeds: Run 10+ seeds per condition
- Burn-in Period: Discard first 100 ticks
- Steady-State Analysis: Focus on ticks 100-900
- Final State Comparison: Compare end states across conditions
Example Analysis
import pandas as pd
from scipy import stats
results = pd.read_csv("results/experiment/metrics.csv")
claude = results[results.llm_type == "claude"].cooperation_index
gemini = results[results.llm_type == "gemini"].cooperation_index
stat, pvalue = stats.mannwhitneyu(claude, gemini)
print(f"Mann-Whitney U: {stat}, p={pvalue:.4f}")
Shock Injection
Test system resilience with controlled perturbations:
Economic Shocks
shocks:
- tick: 500
type: economic
params:
currencyChange: -0.5 # Destroy 50% of currency
Natural Disasters
shocks:
- tick: 500
type: disaster
params:
type: drought
severity: 0.7
duration: 100
Publishing Research
Required Disclosures
When publishing SimAgents research, include:
- Experiment Configuration: Full YAML config
- Seeds Used: All random seeds
- Software Version: SimAgents commit hash
- LLM Versions: Specific model versions
- Metrics Definitions: Any custom metrics
Suggested Citation
@software{simagents2026,
title = {SimAgents: A Platform for Studying Emergent AI Behavior},
author = {AgentAuri Team},
year = {2026},
url = {https://github.com/agentauri/simagents.io}
}
Known Limitations
- LLM Stochasticity: Even with seeds, LLM responses vary
- API Latency: External LLM calls add timing variability
- Scale Limits: Currently tested up to 50 agents
- No Long-term Memory: Agent memory is per-session