Gradient Labs

Gradient Labs builds an autonomous AI agent for customer operations in financial services — a “suite of specialist agents for lending, disputes, and KYC with a platform that runs the operations in between” (home). It handles support, collections, onboarding, KYB, disputes and claims over email, text and voice, “safely and effectively” in a regulated setting. The engineering bet — laid out in an unusually candid team blog — is reliability as a first-class problem: each customer conversation runs as a long-running Temporal workflow, a blend of OpenAI, Anthropic and Google models sits behind two layers of failover, and the agent’s behaviour is authored by ops experts as plain-English SOPs rather than deterministic dialog trees.

Vitals: founded 2023 · Series A raised to $26M (Octopus Ventures + CommerzVentures; orig. $13M Redpoint, Jul 2025) · ~40+ people · London (HQ) + New York.

Business context — founders, funding, customers, traction

Founders (all ex-Monzo): Dimitri Masin (CEO, Monzo’s 20th employee, led a 100+ data team), Danai Antoniou (Chief Scientist, built an “industry-first fraud detection system”), Neal Lathia (CTO, built Monzo’s ML infrastructure) (About). They “started and scaled the Data Science and Machine Learning disciplines” at Monzo before spending 14 months in stealth and launching the agent in 2024 (About).
Funding: £2.8M seed led by LocalGlobe (Aug 2024); $13M Series A led by Redpoint Ventures (Jul 2025, w/ Exceptional Capital, Liquid 2, LocalGlobe, Puzzle); Series A later increased to $26M, led by Octopus Ventures and CommerzVentures (Jun 2026) (blog, About).
Customers: Plum, Zego, SteadyPay, Pockit, LHV Bank (home, About).
Reported outcomes: 80–90% peak resolution, 98% CSAT, 32M customers served; Plum hit a 98.6% QA score and 80% CSAT with a “30 minutes and no engineering effort” setup; Zego saw 16% higher CSAT than human agents; SteadyPay’s voice agent hit a 60% success rate among engaged customers (home).
Endorsement: Tom Blomfield (former Monzo CEO) is a named ambassador (About). Team drawn from Monzo, Pleo, Google, Wise, Mastercard, Revolut (Careers).

The heavy lifting

Each conversation is a durable Temporal workflow. A single agent reply is a chain of LLM calls spanning long durations; rather than retry the whole chain on one failure, “each conversation … is a long-running Temporal workflow which manages the conversation’s state, timers, and runs child workflows to generate responses” (incident, resilient) — checkpointed progress out of the box, so a mid-chain failure resumes instead of restarting.
Two-layer LLM failover (provider, then model). Every completion request carries an “ordered list of API provider preferences” — GPT via OpenAI→Azure, Claude via Anthropic→AWS→GCP, Gemini via GCP regions — failing over on 5XX errors, rate limits, invalid outputs, or p99+ latency; and for critical components they keep “tailored prompts for both the primary and backup models” so a whole provider group going down drops to a backup model, not silence (resilient).
Behaviour authored as plain-English SOPs, not workflows. They “do away with … box & arrow workflows altogether” in favour of “an engine for AI agents to safely follow SOPs that are written in plain English” — a moderate login-troubleshooting flow would need 60–80 workflow elements, but reads as a paragraph an ops expert can edit (sop).
Per-building-block model choice behind one interface. AI engineers “pick the ideal model for the building block they are working on” and swap models with a one-line edit; provider routing, failover and completion logging live in the internal abstraction, invisible to them (blend) — so quality/latency/cost is tuned per component, not committed across the whole agent.

Stack

A Go backend on Encore.dev + Temporal, deployed on Google Cloud Run, fronting a multi-vendor LLM layer. Rows are from the engineering blog and job board; LLM routing/eval internals aren’t fully named — see Likely internals.

Layer	Choice	Evidence
Backend language	Go	Backend Eng, Backend JD
Backend engine	Encore.dev — Go services + Postgres + Pub/Sub to their own cloud account	Backend Eng
Durable execution	Temporal (Temporal Cloud) — long-running, fault-tolerant workflows	Backend Eng, resilient, incident
Datastore	Postgres, with pgvector for similarity search	Backend Eng
Deploy / cloud	Google Cloud Run, GCP, Kubernetes	incident, Backend JD
Analytics	Google BigQuery	Backend Eng
Frontend	Vercel (+ Product Engineering)	Backend Eng, Product JD
Incident mgmt	Incident.io	Backend Eng
Conversation core	a finite-state machine that triggers the agent, dispatches actions, handles failures	Backend Eng
LLM providers	OpenAI (OpenAI/Azure) · Anthropic (Anthropic/AWS/GCP) · Google (GCP) — a blend of GPT, Claude, Gemini	resilient, blend, home
Agent methods	tool calling, multi-step reasoning, customer-API integration, eval suites	AI Eng JD, blend
Retrieval	pgvector RAG + procedure execution (beyond standard RAG)	Backend Eng, rag
Compliance	SOC 2 Type 2; SSO, RBAC, audit logs; GDPR	home

Hard problems

The parts an engineer at this company loses sleep over. Public signal is cited (verified); likely approach is labeled speculation — best-practice fill-in, hedged.

Problem	Why it’s hard	Public signal	Likely approach (speculative)
Reliability of long agent chains	A reply is many LLM calls over long durations; in a bank, “there’s no excuse for [the] AI agent not to be able to reply” — but retrying the whole chain on one failure is wasteful and slow	conversations are “long-running Temporal workflow[s]” with checkpointed state, timers, child workflows (incident, resilient)	Workflow = unit of durability; idempotent activities per LLM call so partial progress survives crashes, autoscaler kills, and rate limits
LLM provider/model outages & limits	Frontier models throttle, 5XX, and slow down unpredictably; a single-vendor dependency takes the agent fully offline	provider failover (4 trigger classes) + model failover with “tailored prompts for both the primary and backup models” (resilient)	Per-request ordered provider list with a short “unavailable” cache on rate-limit; auto-failover on latency-distribution shifts (an open idea they floated)
Choosing & safely executing the right approach	Finance queries split into info / personal / procedural; standard RAG can disclose internal-only info or miss that a customer is vulnerable	”the meta-capabilities of knowing when to use which approach”; vulnerability → “redirect … not answer” (rag)	A classifier/router picks RAG vs SOP vs tool-call; guardrails gate each turn; abstain/escalate to a human on low confidence or risk
Provably-compliant behaviour per turn	UK/US/EU rules (FCA Consumer Duty, CONC, Reg E/Z, PSD2, EU AI Act) must hold on every turn, not on average	”20+ guardrails” that “run on every turn of conversation” (home)	A guardrail layer wrapping each turn — deterministic policy checks + LLM critics — with full audit logging for regulators

Likely internals

The infrastructure Gradient Labs doesn’t fully name, inferred from the stack it does (Go/Encore/Temporal on GCP, a multi-vendor LLM layer):

Component	Likely choice	Basis
LLM router / gateway	in-house orchestrator over OpenAI/Anthropic/Google with provider-preference lists	named “orchestrator” / “internal abstraction” (owl, blend); routing + failover logic described, no third-party gateway named
Eval / simulation	in-house eval suites + conversation simulation + LLM-as-judge	”eval suites” (AI Eng JD); “simulations … customer conversation synthesis” (blend); exact tooling unstated
Guardrail engine	layered deterministic + LLM policy checks per turn	”20+ guardrails on every turn” (home); implementation unstated
Message bus	GCP Pub/Sub (via Encore)	“Pub/Sub” through Encore (owl); GCP-native given the stack
Auth / SSO	a vendor (e.g. WorkOS) for SAML/OIDC + RBAC + audit	”SSO … audit logs … role-based permissions” (home); vendor unnamed
Frontend framework	Next.js on Vercel	Vercel verified (owl); conventional pairing; Product Engineering role
Voice stack	telephony + STT/TTS vendor for the voice agent	voice product live (home); vendor unnamed
Deployment topology	single-tenant / deploy-into-customer-cloud for some enterprise	Founding Platform JD: “across our and others’ cloud environments” (ashby); scope unstated
Observability	Google Cloud Profiler + custom metrics/alerts; Incident.io	Profiler + latency alerts used in a real incident (incident); Incident.io adopted (owl)

Architecture

A conversation is a durable workflow

Inbound messages (email, text, voice, via help desks or API) hit a finite-state machine that “models conversations and is responsible for triggering our first AI agent, dispatching actions, and handling failures” (owl). That FSM runs inside a per-conversation Temporal workflow, so state, timers and the LLM-call child workflows all checkpoint. The agent classifies the query (general info / personal info / procedural), picks an approach — RAG over pgvector, a plain-English SOP, or tool calls to the customer’s APIs — and routes the result through 20+ guardrails on every turn before replying, escalating to a human when a customer looks vulnerable or the action is out of policy (rag, sop, home).

Mermaid source

flowchart LR
  classDef io fill:#fdf4e8,stroke:#d97706,stroke-width:1.5px,color:#0f172a;
  classDef ai fill:#eafbf1,stroke:#16a34a,stroke-width:1.5px,color:#0f172a;
  classDef data fill:#e8f1fd,stroke:#2563eb,stroke-width:1.5px,color:#0f172a;
  classDef human fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a;

  In(["Inbound<br/>email · text · voice<br/>(help desk / API)"]):::io

  subgraph WF["Conversation = one durable Temporal workflow · state · timers · child workflows"]
    direction TB
    FSM("Conversation FSM<br/>triggers agent · dispatches actions · handles failures"):::data
    Classify("Classify the query<br/>general info · personal info · procedural"):::ai
    Route{"Pick the approach<br/>(meta-capability)"}:::ai
    RAG("Answer<br/>retrieve over pgvector"):::ai
    Proc("Run procedure<br/>plain-English SOP"):::ai
    Tool("Take action<br/>tool calls to customer APIs"):::ai
    Guard{"20+ guardrails per turn<br/>FCA Consumer Duty · CONC · Reg E/Z · PSD2 · EU AI Act"}:::data
    FSM --> Classify --> Route
    Route -->|info| RAG
    Route -->|account| Tool
    Route -->|"can you…?"| Proc
    RAG --> Guard
    Proc --> Guard
    Tool --> Guard
  end

  Reply(["Reply to customer<br/>(observable · auditable)"]):::io
  Human("Escalate / sign-off<br/>vulnerability · high-stakes"):::human

  In --> FSM
  Guard -->|pass| Reply
  Guard -->|"vulnerable / out of policy"| Human --> Reply

Two-layer LLM failover

The model layer is where reliability is won. Each completion request — chosen per building block, with the model selectable in one line — starts with an ordered provider-preference list (configurable globally and per-company, with proportional traffic splitting). On 5XX errors, rate limits, invalid outputs, or p99+ latency it fails over to the next provider for the same model; if a whole model group’s providers are down, it fails over to a backup model that has its own tailored prompt (resilient, blend).

Mermaid source

flowchart LR
  classDef io fill:#fdf4e8,stroke:#d97706,stroke-width:1.5px,color:#0f172a;
  classDef ai fill:#eafbf1,stroke:#16a34a,stroke-width:1.5px,color:#0f172a;
  classDef data fill:#e8f1fd,stroke:#2563eb,stroke-width:1.5px,color:#0f172a;
  classDef sys fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;

  Req(["Completion request<br/>per agent building block<br/>(one-line model choice)"]):::io
  Pref("Ordered provider preferences<br/>global + per-company · proportional split"):::data

  subgraph Primary["Primary model · provider failover"]
    direction TB
    P1("OpenAI GPT<br/>OpenAI → Azure"):::ai
    P2("Anthropic Claude<br/>Anthropic → AWS → GCP"):::ai
    P3("Google Gemini<br/>GCP regions"):::ai
  end

  Trig{"Fail over when:<br/>5XX · rate-limited · invalid output · p99+ latency"}:::data
  Backup("Model failover<br/>backup prompt-model pair<br/>(whole provider group down)"):::ai
  Out(["Validated completion"]):::io

  Req --> Pref --> Primary
  Primary --> Trig
  Trig -->|"next provider"| Primary
  Trig -->|"group unavailable"| Backup
  Trig -->|ok| Out
  Backup --> Out

Team & process

A small (~40+) London-HQ’d team of ex-Monzo / Pleo / Google builders, hybrid 2–3 days/week from the Liverpool Street office, with a New York presence and an ex-finance AI Delivery team that takes customers live (About, Careers).

Role	Person	Source
Co-founder / CEO	Dimitri Masin	About
Co-founder / Chief Scientist	Danai Antoniou	About
Co-founder / CTO	Neal Lathia	About

Engineering splits into a few sharply-scoped tracks: Backend Engineers (senior/staff+, “own systems that matter — from the first architectural decision through to production, scale, and everything that breaks”), AI Engineers (a build-and-ship role turning “ambiguous customer support problems into reliable, observable AI agents” and owning eval suites), a Founding Platform & Security Engineer reporting to the CTO to “deploy our agent globally across multiple clouds,” and Product Engineers (Careers). The blog itself is the process tell: the team writes openly about durable-execution design, a memory-leak incident root-caused to the Temporal workflow cache (and the Cloud Run autoscaling pitfall that followed the fix), and why they blend models — the engineering culture is to “finely tune every single layer … the prompts, the LLM providers, the databases, and all the way through to the containers” (incident).

Sources

Reconstructed from public sources only — no insider information. Crawled 2026-06-10 via Chrome MCP (logged-out) + the Ashby posting API. First-party (gradient-labs.ai, the engineering blog at blog.gradient-labs.ai, Gradient Labs’ Ashby board) prioritized; press labeled third-party. Claim tiers: verified (stated on a public page, linked) · inferred (reasoned from a cited signal, confidence flagged) · speculative (best-practice fill-in, labeled). Links are live; pages change, so the supporting quote for each claim is kept in this repo’s evidence map (evidence/gradient-labs-evidence-map.md).

#	Source	Link
S1	Homepage	https://gradient-labs.ai/
S2	About us	https://gradient-labs.ai/about
S3	Marketing blog index	https://gradient-labs.ai/blog
S4	Engineering blog — archive	https://blog.gradient-labs.ai/archive
S5	Drawing the Rest of the Owl (Backend Engineering)	https://blog.gradient-labs.ai/p/drawing-the-rest-of-the-owl
S6	Building resilient agentic systems	https://blog.gradient-labs.ai/p/building-resilient-agentic-systems
S7	Anatomy of an AI agent incident	https://blog.gradient-labs.ai/p/anatomy-of-an-ai-agent-incident
S8	LLMs at Gradient Labs: the perfect blend	https://blog.gradient-labs.ai/p/llms-at-gradient-labs-the-perfect
S9	Are AI agents just RAG in disguise?	https://blog.gradient-labs.ai/p/are-ai-agents-just-rag-in-disguise
S10	Making customer support automation as simple as writing a document	https://blog.gradient-labs.ai/p/making-customer-support-automation
S11	Job board (Ashby)	https://jobs.ashbyhq.com/gradient-labs