Model Routing

Every LLM request in Obsidian passes through the routing layer. This is not a load balancer. It is an intelligent dispatch system that evaluates cost, latency, quality, and availability in real time — then sends each request to the provider most likely to produce the best result at the lowest cost within the required time.

This is Constitution Principle 6 made operational: no vendor lock-in; no single point of failure. The routing layer treats the model landscape as dynamic and heterogeneous. Providers come and go. Prices change. Quality drifts. The router adapts.

◈ Routing Architecture


┌──────────────────────────────────────────────────────────┐
│                    LLM Router                             │
│                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌──────────────────┐  │
│  │  Request    │  │  Strategy   │  │  Provider         │  │
│  │  Classifier │──│  Selector   │──│  Health Monitor   │  │
│  └────────────┘  └─────────────┘  └──────────────────┘  │
│                         │                                 │
│  ┌──────────────────────▼─────────────────────────────┐  │
│  │              Provider Pool                          │  │
│  │                                                     │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐           │  │
│  │  │Anthropic │ │ OpenAI   │ │  Ollama  │  ...      │  │
│  │  │Claude 4  │ │ GPT-4.1  │ │  Local   │           │  │
│  │  │$3/MTok   │ │ $2/MTok  │ │  $0      │           │  │
│  │  │ 1.2s p50 │ │ 0.8s p50 │ │ 2.1s p50│           │  │
│  │  └──────────┘ └──────────┘ └──────────┘           │  │
│  └─────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

How Routing Works

Every LLM request carries metadata: the task type, the required capability (code generation, analysis, summarization), the priority level, and optional constraints (max latency, max cost, required provider). The router uses this metadata to select the optimal provider through a three-stage pipeline.

Stage 1: Request Classification

The router classifies each request by capability requirements. A code generation task needs a model with strong coding benchmarks. A summarization task can use a cheaper, faster model. A security audit needs the highest-quality reasoning available regardless of cost.

# Classification rules in obsidian.yaml
llm:
  classification:
    code_generation:
      min_quality: high
      preferred_providers: [anthropic, openai]
    summarization:
      min_quality: medium
      optimize_for: cost
    security_audit:
      min_quality: highest
      optimize_for: quality

Stage 2: Strategy Selection

Obsidian ships with four routing strategies. Each balances the competing demands of cost, quality, and speed differently.

Cost-optimized — selects the cheapest provider that meets the minimum quality threshold. This is the default for routine tasks. Token costs are real money; the router tracks them per-provider, per-model, and per-task-type.

Quality-optimized — selects the highest-quality provider available, regardless of cost. Reserved for tasks where getting it wrong costs more than getting it expensive — security audits, architectural decisions, production deployments.

Latency-optimized — selects the fastest responding provider. Used when time matters more than depth — interactive sessions, real-time agent coordination, user-facing completions.

Balanced — the weighted default. Scores each provider on a composite of cost (30%), quality (40%), and latency (30%), then selects the highest-scoring available provider. The weights are configurable.

llm:
  routing:
    default_strategy: balanced
    weights:
      cost: 0.3
      quality: 0.4
      latency: 0.3
    overrides:
      - task_type: security_audit
        strategy: quality
      - task_type: summarization
        strategy: cost
      - priority: critical
        strategy: quality

Stage 3: Provider Health Check

Before dispatching, the router checks provider health. A provider that is down, degraded, or circuit-broken is excluded from selection — no matter how cheap or high-quality it would otherwise be.

Health is determined by three signals:

Availability — is the provider responding to health checks?
Error rate — what percentage of recent requests failed?
Latency trend — is response time increasing beyond acceptable bounds?

Failover

When a provider fails mid-request, the router does not surface the error to the calling agent. It retries with the next-best provider automatically. This is transparent failover — the agent sees a slightly slower response, not an error.

◈ Failover Flow


Agent Request
     │
     ▼
┌──────────┐
│ Router   │──── Select: Anthropic (best score)
└────┬─────┘
     │
     ▼
┌──────────┐
│Anthropic │──── TIMEOUT (circuit breaker trips)
└────┬─────┘
     │ failover
     ▼
┌──────────┐
│ OpenAI   │──── SUCCESS (response returned)
└────┬─────┘
     │
     ▼
Agent receives response
(unaware of failover)

The failover chain respects the original routing constraints. If the request required min_quality: high, the failover will not route to a model that does not meet that threshold — it will exhaust all qualifying providers before returning an error. Degraded service is acceptable. Wrong answers are not.

Circuit Breakers

Each provider has an independent circuit breaker. When a provider’s error rate exceeds the configured threshold (default: 50% over 60 seconds), the circuit opens and the provider is excluded from routing for a cooldown period.

llm:
  circuit_breaker:
    error_threshold_percent: 50
    window_seconds: 60
    cooldown_seconds: 300
    half_open_requests: 3

After cooldown, the circuit enters half-open state — a small number of probe requests are sent to test recovery. If they succeed, the circuit closes and the provider rejoins the pool. If they fail, the cooldown resets. This is Constitution Principle 1: failures are inevitable; design for survival, not prevention.

Cost Tracking

The router tracks every token. Input tokens, output tokens, per-model pricing, per-task cost attribution. This is not an afterthought — it is an operational requirement. LLM costs are the dominant variable cost in any agent system, and you cannot manage what you cannot measure.

# Real-time cost visibility
obs llm cost --today
obs llm cost --by-provider --since 7d
obs llm cost --by-task-type --since 30d

# Budget enforcement
obs llm cost --budget

Cost data feeds into routing decisions. If a provider raises prices, the cost-optimized strategy automatically shifts traffic. If monthly spending approaches budget limits, the router can downgrade non-critical requests to cheaper models. This is Principle 7 in action: learn continuously from production reality.

Budget Limits

llm:
  budget:
    daily_limit_usd: 100
    monthly_limit_usd: 2000
    alert_threshold_percent: 80
    enforcement: warn  # warn | soft_limit | hard_limit

When enforcement is hard_limit, the router will reject non-critical requests once the budget is exhausted. Critical requests (security, escalation) are always allowed. The system degrades gracefully rather than going dark.

Provider Abstraction

Every LLM provider implements the same interface. Anthropic, OpenAI, Google, Mistral, local Ollama instances — the router does not care which provider handles a request, only that the interface contract is honored.

interface LLMProvider {
  name: string;
  models: ModelConfig[];

  complete(request: CompletionRequest): Promise<CompletionResponse>;
  stream(request: CompletionRequest): AsyncIterable<StreamChunk>;

  healthCheck(): Promise<HealthStatus>;
  tokenCost(model: string, tokens: TokenCount): CostEstimate;
}

This abstraction is what makes vendor independence real rather than aspirational. Adding a new provider means implementing one interface — no changes to core routing logic, no changes to agent code, no changes to configuration beyond adding the provider to the pool. Constitution Principle 11: extend without modifying core.

Provider Configuration

llm:
  providers:
    anthropic:
      api_key: "${secret:llm/anthropic-key}"
      models:
        - name: claude-sonnet-4-20250514
          quality: highest
          max_tokens: 200000
          cost_per_mtok_input: 3.00
          cost_per_mtok_output: 15.00
    openai:
      api_key: "${secret:llm/openai-key}"
      models:
        - name: gpt-4.1
          quality: high
          max_tokens: 128000
          cost_per_mtok_input: 2.00
          cost_per_mtok_output: 8.00
    ollama:
      endpoint: "http://localhost:11434"
      models:
        - name: llama3
          quality: medium
          max_tokens: 8192
          cost_per_mtok_input: 0
          cost_per_mtok_output: 0

Secret references (${secret:...}) are resolved at runtime through the Secrets Management layer. API keys never appear in configuration files. This is Principle 8: safety through boundaries, not trust.

Benchmarking

The router’s quality scores are not static. They are calibrated through continuous benchmarking — automated test prompts run against each provider on a configurable schedule, measuring response quality, latency, and consistency.

obs llm benchmark --provider anthropic
obs llm benchmark --all --suite coding
obs llm benchmark --report --since 30d

Benchmark results feed back into routing weights. If a provider’s quality degrades, the router adjusts automatically. If a new model appears that benchmarks better, it rises in the rankings. This is the feedback loop that makes Principle 7 (learn continuously from production reality) concrete.

Request Sanitization

Before any request reaches a provider, the sanitization layer strips sensitive content. Secret patterns, internal URLs, customer data markers — anything that should not leave the system boundary is redacted before transmission and restored after response.

This operates at the routing layer, not the agent layer, because agents should not need to think about data leakage. Security is structural, applied uniformly, and invisible to the components it protects.

Why This Matters

The model landscape changes faster than any other part of the stack. New providers appear monthly. Pricing shifts weekly. Quality varies by task type, prompt structure, and model version. A system that hard-codes its LLM dependency is a system that will be expensive, fragile, or obsolete — usually all three simultaneously.

The routing layer is Obsidian’s answer to this reality. It treats providers as fungible resources differentiated by measurable properties, routes intelligently based on actual production data, fails over transparently, and adapts to changes without human intervention. This is not sophisticated load balancing. This is the architectural foundation that makes multi-agent orchestration economically viable at scale.