PureClaw was built on bare metal. The first version ran on NVIDIA Blackwell GPUs in our own data centre, serving autonomous AI agents through Telegram, Discord, and email with local inference via vLLM. No cloud dependency, no token costs, sub-second response times.
That architecture works. But it does not scale to customer-facing deployments. When enterprises want to run PureClaw agents in their own environments, they need managed infrastructure, global availability, and integration with the cloud platforms they already use.
Google Cloud Platform is the natural fit. Here is why, and how we are building it.
Why Google Cloud
Google's AI infrastructure strategy aligns with PureClaw's architecture more closely than any other cloud provider. Three capabilities stand out:
Vertex AI Agent Engine provides managed agent runtime with session persistence, memory banks, and automatic scaling. PureClaw's agent definitions deploy via the Agent Development Kit (ADK), mapping our observer hierarchy directly to ADK agent types: SequentialAgent for multi-step document pipelines, ParallelAgent for concurrent data gathering, and LoopAgent for retry-based operations.
Gemini as a native backend. PureClaw already supports eight interchangeable LLM backends. Adding Gemini 2.5 Flash for high-throughput tasks (email classification, dispatcher cards, context compression) and Gemini Pro for deep reasoning (intelligence analysis, document compilation) consolidates three separate billing relationships into a single GCP account. The 2M token context window eliminates conversation compression entirely for cloud-hosted agents.
The A2A protocol. Google's Agent-to-Agent protocol solves the inter-agent communication problem we have been addressing with our custom mesh layer. PureClaw already runs a distributed mesh of agent instances with authority levels (ALERT_ONLY, AUTO_FIX, ESCALATE, FULL) and HTTP-based message passing. A2A provides a standardised protocol for the same pattern, enabling PureClaw agents to coordinate with third-party agent systems without custom integration work.
The Hybrid Architecture
We are not migrating to the cloud. We are extending to it. The architecture is hybrid by design:
On-prem handles latency-sensitive inference. Interactive conversations flow through local NVIDIA GPUs via vLLM. Nemotron Super delivers sub-second first-token latency with zero network round-trips. Voice transcription stays local via faster-whisper. This is where the hardware investment pays off.
GCP handles elastic scale and managed services. The core Nexus service runs on Cloud Run with always-on instances for bot long-polling. Background observers migrate to GKE Autopilot CronJobs with automatic resource provisioning. Pub/Sub replaces direct observer scheduling with a fully event-driven architecture, decoupling producers from consumers.
The failover chain becomes: vllm(local) -> gemini_flash(vertex) -> gemini_pro(vertex) -> claude(vertex). If the local GPU fleet is busy or offline, inference seamlessly shifts to Vertex AI. If one cloud model has an outage, the next in the chain activates. No manual intervention.
GCP Service Mapping
Every GCP service we use maps to a specific capability in PureClaw's existing architecture:
Cloud Run hosts the core Nexus service. It is a single-replica stateful service that coordinates Telegram/Discord bots, the mesh server, and the dispatcher. Cloud Run's always-on CPU allocation keeps the event loop live for long-polling. Zero cluster management overhead compared to running our own K3s.
GKE Autopilot runs the 20+ background observers as Kubernetes CronJobs. Google provisions exactly the resources each job needs: 256Mi for lightweight heartbeats, 1Gi with LLM access for deep analysis tasks. The manifests are identical to our current K3s CronJobs.
Pub/Sub and Eventarc form the event bus. Observers publish results to topic channels. Eventarc routes events to the right consumers: Telegram notifications, BigQuery audit logging, or downstream processing pipelines. An observer does not need to know where its output goes.
Firestore replaces SQLite for operational state: sessions, observer state, email tracking, draft queues. Real-time listeners enable cross-service state coordination that file-based storage cannot provide at scale.
BigQuery provides the audit layer. Every tool call, LLM invocation, and agent decision gets logged via the ADK Agent Analytics plugin. Cost tracking per backend, per observer, per user. This is the compliance layer that enterprise customers require.
Cloud Storage handles media: voice memo uploads, generated documents, image processing, and daily memory backups with lifecycle policies.
Secret Manager centralises API keys, OAuth tokens, and mesh certificates with IAM-bound access. Each Cloud Run service gets only the secrets it needs.
Vertex AI Model Strategy
PureClaw's multi-backend architecture maps directly to Vertex AI Model Garden:
Gemini 2.5 Flash handles high-volume, low-latency tasks: dispatcher card generation, email classification, context compression, and observer triage. At $0.30/M input tokens, it is cost-effective for the hundreds of daily operations that do not need deep reasoning.
Gemini Pro replaces AWS Bedrock Claude as the primary cloud reasoning backend. Native grounding with Google Search powers the intelligence analysis and threat feed observers. The 2M token context window eliminates the conversation compression step entirely.
Claude on Vertex AI (MaaS) handles tool-use-heavy tasks requiring Claude's instruction following capabilities. Available as Claude Opus and Sonnet on Vertex AI, with the same API format and consolidated GCP billing.
Local vLLM remains the primary backend for interactive conversations. NVIDIA Nemotron Super on Blackwell GPUs, with the cloud backends as automatic failover.
What This Enables
The hybrid architecture enables three capabilities that pure on-prem cannot:
Customer-facing agent deployments. Enterprises can run PureClaw agents on GCP infrastructure with managed scaling, per-second billing, and Vertex AI Agent Engine handling session persistence and memory. No need to provision their own GPU fleet.
Global availability. Cloud Run and GKE Autopilot provide multi-region deployment. An agent that serves users in London and Tokyo does not need GPU infrastructure in both locations when cloud inference handles the remote traffic.
Unified observability. Cloud Logging, Cloud Trace, and BigQuery Agent Analytics provide a single pane of glass across all agent operations. Combined with PureClaw's existing audit trail and credential redaction, this meets the compliance requirements of regulated industries.
PureClaw was designed from day one to be model-agnostic and deployment-flexible. Google Cloud does not replace our infrastructure. It extends it to the scale that enterprise customers need.