Harvey

Harvey is AI software for legal and professional services — Assistant, Vault, Workflows, and Word/Outlook add-ins used by “1500+ customers in 60+ countries” who now run “more than 25,000 custom agents” on it (careers, growth). The technically interesting part is the shift it describes openly: “we’ve moved Harvey from a chat product to cloud agents — from answering a lawyer’s question to completing a lawyer’s task end to end,” like reviewing a data room across hundreds of thousands of documents (runtime). To do that for regulated law firms, Harvey built and runs its own multi-model cloud agent runtime — because zero data retention, model neutrality, and cost control are blockers no managed agent platform meets yet.

Vitals: founded 2022 · $200M growth round at $11B (Mar 2026, co-led GIC + Sequoia) · several hundred employees · San Francisco HQ (+ NY, Singapore) (growth, careers).

Business context — founders, funding, customers, moat

Founders: Winston Weinberg (CEO, ex-O’Melveny & Myers securities/antitrust litigator) and Gabriel Pereyra (President & Chief Scientist, ex-DeepMind / Meta AI) — Pereyra by-lines the runtime and Spectre engineering posts (Wikipedia, runtime).
Funding: $200M growth round at an $11B valuation, co-led by GIC and Sequoia (Mar 25 2026), with a16z, Coatue, Conviction, Elad Gil, Evantic, Kleiner Perkins (growth). Preceded by a $300M Series E at $5B co-led by Kleiner Perkins + Coatue (Jun 2025), OpenAI Startup Fund and REV (LexisNexis’ RELX) among earlier backers (Series E).
Traction: processing “billions of prompt tokens and millions of daily requests” (careers); customers run “more than 25,000 custom agents” (growth). Named customers span Deutsche Telekom, Reed Smith, Syngenta, Repsol, Cuatrecasas, Adecco, CMS, Ashurst, Baker Donelson, GSK Stockmann (home, growth).
Moat (positioning): deep legal domain integration (former lawyers embedded in engineering), enterprise compliance (SOC 2 II, ISO 27001/27701/42001, GDPR, CCPA), and ownership of the agent runtime itself — which is what makes conflict-aware governance and sovereign/self-host deployments possible (security, runtime).

The heavy lifting

An abstraction layer that turns “which model” into a routing decision. Every provider exposes a different agent harness — “different tool-call formats, stop conditions, streaming behavior, and failure modes” — and a different sandbox, so the same task tuned for one model underperforms on another. Harvey “built an abstraction layer that normalizes the harness, the sandbox, and the behavioral differences beneath a single interface,” then routes across frontier labs (Anthropic, OpenAI), cloud runtimes (Azure Foundry, AWS, Google), and self-hosted open-source (runtime). The constraint it beats is structural, not preference: a client that trains its own models “will not allow its outside counsel to send sensitive legal matters through a competitor’s model,” so multi-model is a conflicts gate, not a feature.
Zero data retention designed into the runtime, not bolted on. The tempting shortcut — store during the run, call a delete endpoint after — “isn’t zero retention; it is retention followed by deletion.” Harvey designs so customer data is “not written into durable application storage by default”; the agent’s transient working disk is “lifecycle-bound to the sandbox and automatically cleaned up as part of teardown.” The hard part is that agents are stateful (working memory, checkpoints), and a managed runtime “earns its keep precisely by persisting all of that for you” — so “automatic state persistence and zero retention are mutually exclusive,” which is exactly why owning the runtime is non-negotiable for privileged work (runtime).
A LAB-benchmarked cost router that sends saturated tasks to cheap models. A single agent run can be “hundreds of model and tool calls over a large corpus,” so frontier-only routing doesn’t scale economically. Harvey’s legal agent benchmark (LAB) shows “open-source models match frontier quality at a fraction of the cost” on many task types, so it routes “to the most efficient model that meets the quality threshold, including open-source models we host ourselves” — an empirical “3-5x cost reductions versus a frontier-only approach” (runtime).
Embedding access enforced at the database layer, because embeddings are reversible. Recent work (Jha et al.) shows “an attacker can reverse any embedding model”, so Harvey treats embeddings “as an extension of your source data” and partitions the vector DB so each workspace has “separate collections and storage … segmented along tenant boundaries and tenant IDs rather than filtered after the fact.” It explicitly rejects post-filtering — “a bug or misconfiguration in the filter logic becomes a complete breach” and leaks membership — enforcing access “so that unauthorized vectors are never retrieved in the first place” (embeddings).

Stack

Public signals from engineering JDs and the technical blog. Vendor-unnamed infra (the vector DB, the durable-run engine, OSS model serving) goes to Likely internals.

Layer	Choice	Evidence
Languages	Python (AI services) + Go (infra/proxy)	core-infra JD (careers)
Frontend	React + TypeScript + TailwindCSS, PWA, internal design system	frontend JD (careers)
Office surfaces	Microsoft Word + Outlook add-ins + web app	frontend JD (careers)
Cloud	Multi-cloud: Azure (preferred) + GCP; multi-region for data residency	core-infra JD (careers)
Orchestration	Kubernetes + container management, networking	core-infra JD (careers)
IaC	Terraform + Pulumi; all vector-DB paths declared as IaC	core-infra JD (careers); embeddings post (embeddings)
Model access	Own model-proxy routing “millions of daily inference requests” across providers	core-infra JD (careers); runtime post (runtime)
Models (routed)	Anthropic + OpenAI + cloud runtimes + self-hosted open-source; newest integrated fast (Fable 5, Opus 4.8, GPT-5.5 preview)	runtime post (runtime); product posts (blog)
Rate limiting / quota	Redis-backed distributed limiting	core-infra JD (careers)
Observability / incident	Datadog, Sentry; PagerDuty, Incident.io	core-infra JD (careers)
Retrieval	vector DB with per-workspace isolation (separate collections/namespaces); semantic + agentic search	embeddings post (embeddings)
Internal eng tooling	GitHub, Linear, Slack, Datadog wired into Spectre	Spectre post (spectre)
Compliance	SOC 2 II, ISO 27001/27701/42001, GDPR, CCPA; SAML SSO, audit logs, IP allow-listing	security page (security)

Hard problems

The parts an engineer here works hardest on. Public signal is verified+cited; likely approach is hedged speculation.

Problem	Why it’s hard	Public signal	Likely approach (speculative)
Multi-model without per-model regression	Each provider has different tool-call formats, stop conditions, streaming, sandboxes; a task tuned for one underperforms on another	”an abstraction layer that normalizes the harness, the sandbox, and the behavioral differences beneath a single interface” (runtime)	Per-provider adapters translate native events to a stable internal shape; route by LAB quality/cost per task type; keep prompts model-portable
ZDR for stateful long-running agents	Agents accumulate working memory + checkpoints; managed runtimes persist that = customer data at rest off-prem	”Automatic state persistence and zero retention are mutually exclusive”; transient disk “lifecycle-bound to the sandbox” (runtime)	Own runtime; durable run record in control plane holds only refs, worker state scoped to session and purged on teardown
Embedding reversal on privileged matters	Embeddings preserve structure (reversible); post-filtering leaks membership and is a single point of failure	per-workspace “separate collections and storage”, tenant IDs; access enforced “at the database layer” (embeddings)	Tenant-namespaced vector store, short-lived programmatic creds, IaC-declared access, encrypted tenant-bound caches, anomaly monitoring
Citation at table scale	A 30-col × 1000-doc review table is “30,000 concurrent cells” and lawyers stake licenses on provenance	sentence-level citations “pointing to indices”; “answer and reasoning” fields; benched with “prompt caching and parallel request handling across different models” (review)	Index-anchored sentence citations; per-cell parallelism + caching to hold latency; reasoning surfaced for verifiability
Vision cost at billions of images	Image processing is “roughly 50x more expensive” than text and “90% of those images are not actually necessary”	on-demand tool, text-first gating; candidate pages “narrows a 500-page document down to 2-3 pages in milliseconds” (vision)	Agent-invoked vision tool gated behind text search; dedicated rendering service; tool-description tuning to balance recall vs over-trigger

Likely internals

Harvey names its requirements precisely but not always its vendors. Inferred from the stack it does name; uncertainty noted in Basis.

Component	Likely choice	Basis
Vector DB vendor	a security-first managed vector store (Turbopuffer/Pinecone-class) or self-managed pgvector/Qdrant with per-tenant namespaces	embeddings post specifies isolation + namespacing requirements but not the product (embeddings)
Durable-run control plane	a durable-execution / workflow engine (Temporal-style) backing run records, checkpoints, and session resume	Spectre describes “durable run”, checkpoints, “control plane appends … restores … session context” — engine unnamed (spectre)
OSS model serving	vLLM/TGI on GPU nodes in Azure/GCP Kubernetes	”open-source models we host ourselves” + K8s/AI-inference infra; serving stack not stated (runtime, careers)
Backend service framework	Python services (FastAPI-style) for AI; Go for the model proxy / infra plane	Python+Go named; web framework not (careers)
Control-plane DB / artifact store	Postgres for run records + object storage (Azure Blob / GCS) for artifacts	standard for the described run/artifact model; not stated
Enterprise auth	external IdP via SAML SSO (+ SCIM provisioning)	“SAML SSO” on security page; vendor not named (security)
Headcount	several hundred	third-party trackers; not stated first-party (Sacra)

Architecture

The cloud agent runtime

A request enters from the web app, a Word/Outlook surface, or a scheduled automation and becomes a durable run record in the control plane — the run, not the worker, is the thing that persists (ownership, history, artifacts, provider session refs). Conflict-aware governance gates which models a matter may even touch. Execution happens in an ephemeral worker inside an isolated sandbox: a harness/abstraction layer normalizes each provider’s harness and events, the model router picks the cheapest model clearing the LAB quality bar (frontier, cloud, or self-hosted OSS), and tools/MCP are injected with short-lived scoped credentials. The sandbox’s transient disk is the ZDR boundary — purged on teardown; durable state is appended back to the run record, never left in the worker. Out come reviewable artifacts and a complete audit trail (runtime, spectre).

Mermaid source

flowchart LR
  classDef io fill:#fdf4e8,stroke:#d97706,stroke-width:1.5px,color:#0f172a;
  classDef ctrl fill:#e8f1fd,stroke:#2563eb,stroke-width:1.5px,color:#0f172a;
  classDef sandbox fill:#eafbf1,stroke:#16a34a,stroke-width:1.5px,color:#0f172a;
  classDef ai fill:#f3eefe,stroke:#7c3aed,stroke-width:1.5px,color:#0f172a;
  classDef ext fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;

  subgraph Surfaces["Entry surfaces"]
    direction TB
    Web(["Web app · Word / Outlook"]):::io
    Auto(["Automations · cron schedule"]):::io
  end

  subgraph Plane["Control plane (durable)"]
    direction TB
    Run[("Durable run record<br/>ownership · history · artifacts<br/>provider session refs")]:::ctrl
    Gov{"Conflict-aware governance<br/>which models a matter may touch"}:::ctrl
  end

  subgraph Box["Ephemeral worker · isolated sandbox (ZDR boundary)"]
    direction TB
    Harness("Harness / abstraction layer<br/>normalizes provider harness · sandbox · events"):::sandbox
    Disk[("Transient working disk<br/>lifecycle-bound · purged on teardown")]:::sandbox
    Harness --- Disk
  end

  Router{"Model router<br/>cheapest model meeting LAB quality bar"}:::ai

  subgraph Models["Model providers (routed)"]
    direction TB
    Frontier("Frontier labs<br/>Anthropic · OpenAI"):::ai
    Cloud("Cloud runtimes<br/>Azure Foundry · AWS · Google"):::ai
    OSS("Self-hosted open-source<br/>3–5x cheaper for saturated tasks"):::ai
  end

  Tools("Scoped tools · MCP<br/>short-lived creds, injected at run start"):::ext
  Artifacts(["Reviewable artifacts<br/>summaries · diffs · audit trail"]):::io

  Surfaces --> Run
  Run --> Gov
  Gov --> Harness
  Harness --> Router
  Router --> Frontier
  Router --> Cloud
  Router --> OSS
  Harness --> Tools
  Harness -->|"state appended back"| Run
  Run --> Artifacts

Document intelligence: isolated RAG with query-time tools

Uploads to Assistant, Vault, or Knowledge are embedded and stored in a per-workspace-isolated vector DB (tenant namespaces, separate collections, encrypted tenant-bound caches). Semantic + agentic search enforces access at the database layer — unauthorized vectors are never retrieved, so there is no post-filter to misconfigure. On top sit query-time tools: review tables (answer + reasoning, sentence-level citations across tens of thousands of concurrent cells) and an on-demand vision tool that is gated text-first and renders only the 2–3 candidate pages it needs. The output is a cited answer a lawyer can verify (embeddings, review, vision).

Mermaid source

flowchart LR
  classDef io fill:#fdf4e8,stroke:#d97706,stroke-width:1.5px,color:#0f172a;
  classDef data fill:#e8f1fd,stroke:#2563eb,stroke-width:1.5px,color:#0f172a;
  classDef ai fill:#eafbf1,stroke:#16a34a,stroke-width:1.5px,color:#0f172a;
  classDef ext fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;

  Upload(["Upload<br/>Assistant · Vault · Knowledge"]):::io

  Embed("Embedding<br/>treated as source-sensitive data"):::ai

  subgraph VDB["Vector DB — per-workspace isolation"]
    direction TB
    NS[("Tenant namespaces<br/>separate collections + storage")]:::data
    Cache[("Tenant-bound caches<br/>encrypted · short-lived keys")]:::data
  end

  Search{"Semantic + agentic search<br/>access enforced at DB layer<br/>(no post-filter)"}:::ai

  subgraph Tools["Query-time tools"]
    direction TB
    Review("Review tables<br/>answer + reasoning<br/>sentence-level citations · 30k cells"):::ai
    Vision("On-demand vision tool<br/>text-first gating · 500p → 2–3p<br/>dedicated rendering service"):::ai
  end

  Answer(["Cited answer<br/>provenance + reasoning, lawyer-verifiable"]):::io

  Upload --> Embed --> NS
  NS --- Cache
  NS --> Search
  Search --> Review
  Search --> Vision
  Review --> Answer
  Vision --> Answer

Team & process

Founder-led by Winston Weinberg (CEO, ex-litigator) and Gabriel Pereyra (President & Chief Scientist, ex-DeepMind/Meta AI), the engineering org is organized around the runtime: Core Infrastructure, Product Engineering, Frontend, Security, DevEx, and Applied Legal Research, across SF (HQ), New York, and Singapore (careers, runtime, spectre).

Role	Person / team	Source
CEO, co-founder	Winston Weinberg	Wikipedia
President & Chief Scientist, co-founder	Gabriel Pereyra	runtime
Core Infrastructure / Security / DevEx / Frontend	named eng teams	careers, spectre
Applied Legal Researchers (ALRs)	former practicing lawyers embedded in eng	review

Two process traits stand out. First, eval is gated by a privacy wall: “no one on our team sees real customer queries,” so former lawyers (ALRs) author evaluation datasets that mirror production, and changes ship on side-by-side preference + latency + reliability metrics rather than vibes (review). Second, the company dogfoods its own agent runtime — Spectre runs internal engineering work (incident investigation in Slack threads, scheduled cleanup/test-gen via cron, PRs) on the same durable-run/ephemeral-worker architecture it sells, which is how it pressure-tests the security and collaboration model before mapping it onto legal matters (spectre). Stated values: “Decisiveness, Simplicity, and Job’s Not Finished,” in-person/hybrid in SF (careers).

Sources

Reconstructed from public sources only — no insider information. Built primarily from Harvey’s own engineering “Technical Deep Dives” and careers JDs, plus the homepage/security page and the funding announcement; crawled 2026-06-10 via Chrome MCP (logged-out), with one web search/fetch for the funding round. Claim tiers: verified (stated on a public page, linked) · inferred (reasoned from a cited signal) · speculative (best-practice fill-in, labeled). Per-claim quotes are in this repo’s evidence map (evidence/harvey-evidence-map.md).

#	Source	Link
S1	Homepage	https://www.harvey.ai/
S2	Why we Built our own Cloud Agent Infrastructure	https://www.harvey.ai/blog/why-we-built-our-own-cloud-agent-infrastructure
S3	Building Spectre (internal cloud agent platform)	https://www.harvey.ai/blog/building-spectre-internal-collaborative-cloud-agent-platform
S4	How Harvey Secures Embeddings at Scale	https://www.harvey.ai/blog/how-harvey-secures-embeddings-at-scale
S5	Rebuilding the Review Algorithm	https://www.harvey.ai/blog/rebuilding-harveys-review-algorithm
S6	How we Built Image Understanding for Legal Documents	https://www.harvey.ai/blog/building-image-understanding-for-legal-documents
S7	Senior SWE, Core Infrastructure (JD)	https://www.harvey.ai/company/careers/748edfbe-f819-47fd-85bb-3c4974f8913f
S8	Senior SWE, Frontend (JD)	https://www.harvey.ai/company/careers/04e17f81-d0a7-4f83-8526-ec4c9532ddcc
S9	Security & compliance	https://www.harvey.ai/security
S10	Growth round at $11B (GIC + Sequoia)	https://www.harvey.ai/blog/harvey-raises-growth-round-at-dollar11-billion-valuation-co-led-by-gic-and-sequoia
S11	Series E ($300M, $5B)	https://www.harvey.ai/blog/harvey-raises-series-e
S12	Sacra (third-party — revenue/headcount)	https://sacra.com/c/harvey/