On March 14, 2025, Anthropic's API went down for several hours. On January 23, 2025, OpenAI had a major outage that lasted most of the day. AWS Bedrock has had multiple incidents affecting model availability. Every major LLM provider has outages. The question is not if your provider goes down, but when.
If your agent platform is built on a single LLM backend, an outage means your agents stop working. PureClaw was designed so that never happens.
The Backend Protocol
At the core of PureClaw's multi-model architecture is a Python Protocol that every backend must implement. The interface is minimal: accept a conversation history, return a streaming response, support tool calling. That is it.
Every backend -- whether it is a local GPU running vLLM, a cloud API like Anthropic or Gemini, or a CLI agent like Claude Code -- implements this same Protocol. The rest of PureClaw does not know or care which backend is active. Tool execution, conversation management, streaming, audit logging: all of it works identically regardless of the model underneath.
This is not an abstraction for its own sake. It is the foundation that makes failover, model switching, and hybrid routing possible.
The Eight Backends
NVIDIA Nemotron Super (local GPU). The default backend. A 120B-parameter Mixture-of-Experts model (12B active) served via vLLM on NVIDIA Blackwell GPUs. Sub-second first-token latency, zero API costs, 262K context window, and continuous batching for concurrent requests. When you have the hardware, nothing beats local inference.
Ollama (local, any model). For deployments without high-end GPUs. Ollama runs open models on consumer hardware. Smaller context, slower inference, but zero cost and zero network dependency. A good starting point before investing in GPU infrastructure.
Anthropic API (cloud). Direct access to Claude via the Anthropic Messages API with prompt caching. Full tool use, streaming, and the full suite of Claude models. Pay-per-token pricing.
AWS Bedrock (cloud). Claude via the AWS Bedrock Converse API. Same model capabilities, but billed through your AWS account. Useful for organisations that consolidate cloud spend through AWS.
Claude Code (CLI agent). Anthropic's agentic CLI. Full sandboxed tool execution with Claude's own tool suite. PureClaw delegates entire tasks to Claude Code and streams the output back to the conversation.
Codex (CLI agent). OpenAI's agentic CLI. Same delegation pattern as Claude Code, but powered by OpenAI's models.
Gemini CLI (CLI agent). Google's agentic CLI. Brings Gemini's grounding capabilities and long-context window to the PureClaw tool execution layer.
Hybrid. The most sophisticated backend. Routes requests based on complexity: simple queries go to a fast API backend (Gemini Flash), complex multi-step tasks go to an agentic CLI backend (Claude Code). One conversation, two models, automatic routing.
Automatic Failover
PureClaw supports configurable failover chains. You define the order of preference:
FAILOVER_CHAIN=vllm,bedrock_api,gemini_api
If the primary backend (vLLM) fails a health check or returns an error, PureClaw automatically promotes the next backend in the chain (Bedrock). If Bedrock also fails, it falls to Gemini. The switch happens mid-conversation without the user noticing. When the primary backend recovers, PureClaw switches back.
This is not a theoretical feature. We run PureClaw in production with failover chains that activate multiple times per week. GPU maintenance, API rate limits, transient cloud errors: the failover chain handles all of them transparently.
Why This Matters
Resilience. No single point of failure. An outage at any one provider does not affect your agents. The failover chain ensures continuous operation across local and cloud backends.
Cost optimization. Use local GPU inference for high-volume, latency-sensitive workloads (zero marginal cost). Fall back to cloud APIs for burst capacity or when the GPU fleet is busy. The hybrid backend automates this: fast queries go to the cheap backend, complex queries go to the capable one.
Vendor independence. LLM pricing changes quarterly. Content policies shift. New models launch. With eight backends behind a common Protocol, switching from one provider to another is a configuration change, not a rewrite. You are never locked into a single vendor's pricing, terms, or availability.
Model-appropriate routing. Different tasks need different models. Gemini Flash excels at classification and summarization. Claude excels at complex reasoning and tool use. Nemotron Super excels at high-throughput local inference. PureClaw lets you use each model where it is strongest.
Adding a New Backend
Implementing the Backend Protocol requires three methods: initialize the client, send a message with streaming, and handle tool calls. A new backend can be added in under 200 lines of Python. The Protocol enforces the interface contract; PureClaw handles everything else.
We started with two backends. We now have eight. As new models and providers emerge, adding them is straightforward. The architecture was designed for a world where the best model today is not the best model tomorrow.
PureClaw does not bet on one model. It bets on optionality.