Skip to main content

Research Guide

This guide covers scientific methodology for conducting rigorous research with SimAgents.

Research Philosophy

SimAgents is designed for studying emergent AI behavior in multi-agent environments. Key principles:

  1. Reproducibility: Every experiment can be replicated with seed + configuration
  2. Observability: All state changes are logged and queryable
  3. Comparability: Standardized metrics enable cross-study comparison
  4. Minimal Imposition: System provides physics, not strategies

Designing Experiments

Experiment DSL

Define experiments in YAML:

name: "resource_scarcity_cooperation"
description: "Test cooperation emergence under resource scarcity"
seed: 12345

world:
size: [100, 100]
biomes:
desert: 0.7
plains: 0.2
forest: 0.1

agents:
- type: claude
count: 5
- type: gemini
count: 5
- type: baseline_random
count: 5

duration: 1000 # ticks

metrics:
- gini
- cooperation_index
- survival_rate
- clustering_coefficient

snapshots:
interval: 100 # Save state every 100 ticks

shocks:
- tick: 500
type: economic
params:
currencyChange: -0.5 # 50% currency destruction

Running Experiments

cd apps/server

# Validate configuration
bun run src/experiments/runner.ts --dry-run experiments/my-experiment.yaml

# Run experiment
bun run src/experiments/runner.ts experiments/my-experiment.yaml

# Run with custom output
bun run src/experiments/runner.ts experiments/my-experiment.yaml --output results/

Batch Experiments

Run multiple seeds for statistical significance:

for seed in 12345 23456 34567 45678 56789; do
bun run src/experiments/runner.ts experiments/my-experiment.yaml --seed $seed
done

Baseline Agents

For valid hypothesis testing, compare LLM agents against baselines:

Random Walk (Null Hypothesis)

agents:
- type: baseline_random
count: 10

Actions chosen uniformly at random. Establishes minimum performance baseline.

Rule-Based (Classical AI)

agents:
- type: baseline_rule
count: 10

Hardcoded heuristics: eat when hungry, sleep when tired, gather when near resources.

Q-Learning (Reinforcement Learning)

agents:
- type: baseline_qlearning
count: 10

Tabular Q-learning with survival reward. Tests LLM vs traditional RL.


Metrics

Economic Metrics

MetricFormulaInterpretation
Gini CoefficientStandard Gini on agent balances0 = equality, 1 = one agent has all
Wealth Varianceσ² of agent balancesHigher = more inequality
Trade VolumeSuccessful trades per tickHigher = more activity

Social Metrics

MetricFormulaInterpretation
Cooperation Indexf(trades, trust, clustering)0-1, higher = more cooperation
Clustering CoefficientSpatial agent groupingHigher = agents form groups
Conflict RateHarm/steal actions per tickHigher = more conflict

Emergence Metrics

MetricFormulaInterpretation
Emergence Index(systemComplexity - Σ agentComplexity) / systemComplexityHigher = more emergent behavior
Role CrystallizationConsistency of agent roles over timeHigher = stable social roles

Survival Metrics

MetricFormulaInterpretation
Survival RateAlive agents / initial agentsBy LLM type
Mean LifetimeAverage ticks survivedLonger = better strategies

Reproducibility

Seed Management

Every random operation uses a seeded PRNG:

seed: 12345  // In experiment config

State Snapshots

Capture complete world state:

snapshots:
interval: 100
include:
- agents
- resources
- relationships
- events

Event Sourcing

All state changes recorded. Replay any moment:

curl http://localhost:3000/api/replay/tick/150

Statistical Analysis

  1. Multiple Seeds: Run 10+ seeds per condition
  2. Burn-in Period: Discard first 100 ticks
  3. Steady-State Analysis: Focus on ticks 100-900
  4. Final State Comparison: Compare end states across conditions

Example Analysis

import pandas as pd
from scipy import stats

results = pd.read_csv("results/experiment/metrics.csv")

claude = results[results.llm_type == "claude"].cooperation_index
gemini = results[results.llm_type == "gemini"].cooperation_index

stat, pvalue = stats.mannwhitneyu(claude, gemini)
print(f"Mann-Whitney U: {stat}, p={pvalue:.4f}")

Shock Injection

Test system resilience with controlled perturbations:

Economic Shocks

shocks:
- tick: 500
type: economic
params:
currencyChange: -0.5 # Destroy 50% of currency

Natural Disasters

shocks:
- tick: 500
type: disaster
params:
type: drought
severity: 0.7
duration: 100

Publishing Research

Required Disclosures

When publishing SimAgents research, include:

  1. Experiment Configuration: Full YAML config
  2. Seeds Used: All random seeds
  3. Software Version: SimAgents commit hash
  4. LLM Versions: Specific model versions
  5. Metrics Definitions: Any custom metrics

Suggested Citation

@software{simagents2026,
title = {SimAgents: A Platform for Studying Emergent AI Behavior},
author = {AgentAuri Team},
year = {2026},
url = {https://github.com/agentauri/simagents.io}
}

Known Limitations

  1. LLM Stochasticity: Even with seeds, LLM responses vary
  2. API Latency: External LLM calls add timing variability
  3. Scale Limits: Currently tested up to 50 agents
  4. No Long-term Memory: Agent memory is per-session

Further Reading