Chaos Engineering

Here is a question that separates serious infrastructure from everything else: have you actually tested what happens when things break? Not “have you written unit tests that mock failures” — have you killed a running Agent mid-task and watched what the system does next?

Obsidian treats chaos engineering not as a luxury or a quarterly exercise, but as a continuous discipline. This is Constitution Principle 1 made operational: failures are inevitable; design for survival, not prevention. You cannot design for survival if you have never observed a failure. You cannot observe a failure if you have never caused one.

The Five Chaos Levels

Obsidian defines five chaos levels, each escalating in severity, blast radius, and the amount of approval required before you unleash it. Think of them as a controlled escalation ladder — you start with a gentle nudge and work your way up to simulated apocalypse.

◈ Chaos Level Escalation


LEVEL   SEVERITY     APPROVAL         BLAST RADIUS
─────   ────────     ────────         ────────────
  1     Gentle       None             Single component
  2     Mild         None             Single agent
  3     Moderate     SRE              Multiple agents
  4     Severe       SRE + Warden     Major components
  5     Apocalypse   SRE + Warden     Unlimited
                     + Human

Level 1: Gentle. Minor delays. Single request failures. Configuration drift. No approval required, maximum 30 minutes, affects at most one component. This is the chaos equivalent of checking whether the smoke detector has batteries.

Level 2: Mild. Single Agent failures. Latency injection. Intermittent errors. Memory pressure. Still no approval needed, but now you are testing whether the system notices and recovers from a component going dark.

Level 3: Moderate. Network partitions. Resource exhaustion. Multiple agent failures simultaneously. This requires SRE approval because you are now testing whether the system maintains Systemic Integrity when several things go wrong at once.

Level 4: Severe. Major component failures. Extended outages. Data corruption scenarios. Cascade failures. Both SRE and the Warden must approve. You are testing whether the system degrades gracefully under conditions that would take down most architectures entirely.

Level 5: Apocalypse. Multiple catastrophic failures simultaneously. Unlimited blast radius. Requires SRE, Warden, and human approval, plus active supervision. This is the scenario nobody wants to encounter in production — which is precisely why you must rehearse it.

Chaos Type Taxonomy

Every chaos injection falls into one of six categories, each targeting a different failure surface:

Agent chaos — kill, freeze, slow, inject errors, corrupt state, or make an agent consume excessive resources. This tests whether Fractal Delegation hierarchies survive the loss or degradation of individual nodes.

Network chaos — partitions, latency, packet loss, bandwidth limits, DNS failures, connection resets. The distributed systems classics, because the network is always lying to you.

Storage chaos — full disks, I/O latency, I/O errors, data corruption, permission changes, vault unavailability. Tests whether your persistence layer is as reliable as you assume it is.

Memory chaos — pressure, leaks, out-of-memory conditions, fragmentation. The failures that creep up slowly and then arrive all at once.

Time chaos — clock skew, NTP failures, timezone shifts, time jumps. Particularly insidious because most systems silently assume clocks are monotonic and synchronized.

Dependency chaos — LLM API outages, slow responses, error injection, database failures, cache misses, external service failures. Tests the boundaries where Obsidian meets the outside world.

Experiment Structure

Every chaos experiment follows a rigorous structure: define a hypothesis, establish steady-state conditions, inject chaos, observe, and measure. This is not random destruction — it is the scientific method applied to infrastructure.

A well-formed experiment declares what “normal” looks like (all agents healthy, tasks processing, Warden active), specifies exactly what chaos to inject and when, defines abort conditions that automatically halt the experiment if things go truly sideways, and collects metrics that either confirm or refute the hypothesis.

The abort conditions deserve emphasis. Every experiment has automatic safety limits — maximum affected agents, maximum duration, maximum blast radius. If the Warden goes down during a Level 2 experiment, the experiment stops. If more than half the agents become unhealthy, the experiment stops. Chaos engineering is controlled destruction, and “controlled” is the operative word. This is safety through boundaries in its most literal form — not trusting that experiments will behave, but structurally guaranteeing they cannot exceed their mandate.

Pre-Built Scenarios

Obsidian ships with pre-built scenarios for the failure modes you should be testing regularly:

Warden Failure and Recovery — kills the Warden via SIGKILL, then verifies that the Scout detects the failure, initiates revival, and restores monitoring. If your constitutional guardian cannot survive assassination, your Constitutional Consensus mechanism has a single point of failure. Level 3.

LLM API Complete Outage — simulates total LLM unavailability and validates that circuit breakers trip, degradation levels increase, and agents switch to fallback mode. When the outage resolves, the system must recover fully within five minutes. Level 4.

Split Brain Network Partition — creates a network partition splitting the agent cluster into isolated groups, then verifies both groups continue operating, task assignment adjusts, and no conflicting file reservations occur. When the partition heals, state must reconcile without data loss. Level 4.

Running Chaos

# Run a pre-built scenario
obs chaos run --scenario warden-failure

# Run at a specific level
obs chaos run --level 2 --type agent.kill --target worker-alpha

# Schedule recurring chaos
obs chaos schedule --scenario llm-outage --cron "0 3 * * 1"

# View experiment results
obs chaos results --experiment exp-agent-revival-001

The obs interface for chaos engineering follows the same pattern as everything else in the system: explicit, auditable, and scriptable. No hidden state. No surprise side effects. JSON output for automation, human-readable output for operators.

Why This Matters

The alternative to chaos engineering is hope — hoping your failover works, hoping your circuit breakers trip, hoping your Warden Chain survives a node failure. Hope is not a strategy. Hope is what you have before you test, and confidence is what you have after.

Every chaos experiment that passes is a proof of resilience. Every experiment that fails is a vulnerability you found before production found it for you. Both outcomes are valuable. Only the untested system is dangerous.

This is what Constitutional Compliance looks like in practice — not a checkbox exercise, but a continuous proof that the system actually behaves according to its own principles. You wrote a Constitution that says failures are survivable. Chaos engineering is how you hold the system to that promise.

Failures are inevitable; design for survival, not prevention.