Momentic

Momentic is AI-native end-to-end testing: you “describe test behavior in natural language,” and “an AI agent turns your prompts into reliable steps, runs them against your app, and auto-heals brittle locators” (Docs). Tests live in your repo as YAML and run against web, iOS, and Android — pitched as the “modern alternative to Selenium, Cypress, and Playwright” (YC). The interesting part isn’t the natural-language front door — it’s the intent-based step cache underneath that lets Momentic call an LLM on ~1 step in 20 and replay the other 19 deterministically (95%+ hit, ~300ms vs >5s uncached). The product is a cache wrapped in an agent (intent blog).

Vitals: founded 2023 · YC W24 · $15M Series A (Standard Capital) + $3.7M seed · ~12 people · SF (on-site).

Business context — founders, funding, customers

Founders Wei-Wei Wu (CEO — ex-Assembled, founding engineer at Nashi → acq. Density 2021, staff engineer at Density) and Jeff An (ex-Splunk/Google; led testing at Robinhood and enterprise quality at Retool; U. Waterloo) (YC) — “two engineers who dreaded testing so much we founded a company to do it for us.”
Series A: $15M led by Standard Capital, with Dropbox Ventures and existing investors (Y Combinator, FCVC, Transpose Platform, Karman Ventures), on top of a $3.7M seed in March 2025 (TechCrunch).
2,600 users across “1000+ engineer organizations” — Notion, Xero, Bilt, Webflow, Retool, Quora, plus Pocus, Nuvo, Mutiny, CoverGo, Coframe, GPTZero (TechCrunch, intent blog, home).
Wu estimates Momentic “automated more than 200 million test steps” in the last month (TechCrunch).

The heavy lifting

A locator is a compiled multi-signal matcher. A step’s NL description resolves once into stored signals — on-screen position, appearance, text, accessibility + structural attributes — plus validity conditions; replay matches those against the live page with no LLM call, so inference runs on ~1 step in 20 (intent blog, step cache).
Invalidation keys on intent, not DOM identity. The cache busts when the element no longer satisfies the attributes / related-elements the user named — not when the DOM node changes — so randomized classnames and restructures don’t bust it, but a renamed semantic does (intent blog).
The cache is an OLTP lookup on ClickHouse. Keyed by test / step / version / branch / commit and served via a sparse primary index + materialized view at ~250ms over ~20B entry-touches/day — an OLAP engine repurposed for high-write key-value reads after Postgres hit lock contention (ClickHouse blog).

Stack

A TypeScript-first CLI, tests as YAML in git, and a ClickHouse cache plane. Every row is named in a first-party doc, repo, or engineering post.

Layer	Choice	Evidence
Languages	TypeScript (primary), Python	GitHub org top languages
Distribution	npm CLI — `npx momentic`, CLI-first; cloud authoring deprecated	Docs, config
Test format	YAML in the repo (`.test.yaml`, `.module.yaml`)	How it works, config
Editor	CodeMirror + TypeScript (low-code local editor)	codemirror-ts fork
Cache store	ClickHouse (`ReplacingMergeTree`, sparse PK, materialized view) — migrated off Postgres + Redis	ClickHouse blog
Browser automation	Chromium driver (Playwright-class), local or managed runner	Docs, Playwright cmp
Mobile	iOS simulators · Android emulators, remote-hosted (regioned)	Docs, config
LLM layer	managed, multi-provider with cross-provider failover (models unnamed)	Playwright cmp, AI config
Coding-agent integration	Claude Agent SDK skill (`npx skills add momentic-ai/skills`) + MCP	GitHub skills, Docs
CI targets	GitHub Actions · CircleCI (orb) · Bitrise	Docs, orb repo
Execution	managed, multi-region runner	Playwright cmp

The in-product agents run on “latest 2025 models” but Momentic never names the provider — the model layer is “managed; cross-provider failover handled by the platform” (Playwright cmp, AI config). The one verified Anthropic touchpoint is the open-source skills repo, “Claude Agent SDK with a E2E testing tool” (GitHub).

Hard problems

The parts an engineer at this company loses sleep over. Public signal is cited (verified); likely approach is labeled speculation — best-practice fill-in, hedged.

Problem	Why it’s hard	Public signal	Likely approach (speculative)
Flaky tests / cache correctness	NL intent is ambiguous; a cache too strict busts on cosmetic change, too loose grabs the wrong element; branches and CLI versions pollute a shared cache	Four documented failure modes; “1M potential flakes across 200M resolutions” (Feb 2026); 95%+ hit rate (intent blog)	Intent conditions (attributes + related elements) from the locator agent; per-branch/version isolation with merge-base seeding — already shipped, now tuning SVG/icon and relativity checks
Inference cost + latency	An LLM per step is ~5s and expensive across 2M+ resolves/day	”300ms cached vs over 5s uncached”; LLM fires only on cache miss (intent blog, how it works)	Aggressive caching as the default path; small specialized agents per task; cap agentic plan depth — only the heal path pays for inference
Cache storage at scale	~20B entry-touches/day, high concurrent read+write, query cost must not grow with data	Postgres lock contention at ~1B entries → ClickHouse; ~250ms avg (ClickHouse blog)	ClickHouse `ReplacingMergeTree` + sparse PK + materialized-view of commit timestamps; insert-only TTL; async dedupe
Testing non-deterministic apps	Gen-AI products don’t return the same output twice, so string-match assertions fail	Poe/Quora case: validate “AI chatbot responses, even when they weren’t deterministic” (home); `assert`/`assertVisually` are agent-scored (Playwright cmp)	Assertion + visual-assertion agents reason over intent (“chart is visible and not cut off”) rather than literal text; never-cache AI-evaluated steps

Likely internals

The infrastructure Momentic doesn’t name publicly, inferred from the stack it does:

Component	Likely choice	Basis
LLM providers	OpenAI + Anthropic + Google, routed	”cross-provider failover” (Playwright cmp); Anthropic confirmed for the skill (GitHub skills); failover implies ≥2 frontier vendors
App-graph embeddings	a hosted embedding API (OpenAI/Cohere-class) over minhashed DOM summaries	states are “embedded” and clustered (app graph); no in-house model signal on a ~12-person team
Mobile runner hosting	a managed device cloud or self-run emulators on a cloud	emulators are “remote-hosted” and regioned (config); provider not named
Run-artifact store	S3-class object storage for videos/traces	dashboard serves “run videos, traces, network” (Playwright cmp); object storage is the default for this
Control-plane DB	Postgres (retained for app/org/auth data after the cache moved to ClickHouse)	they “eliminate[d] the Redis layer” but only moved cache off Postgres (ClickHouse blog); relational data likely stays
Hosting	a major cloud (AWS or GCP) with managed ClickHouse	multi-region runner + ClickHouse at this scale (Playwright cmp, ClickHouse blog); managed ClickHouse Cloud is the low-ops path for ~12 people
Auth	enterprise SSO (SAML/OIDC), API keys	”custom SSO” offered (YC); `MOMENTIC_API_KEY` for CLI auth (config)

Architecture

The agent loop: cache first, LLM on miss

A step’s life is prompt → context → action → verify → cache → replay → heal. The agent “reads the page (DOM, accessibility tree, screenshot),” picks an element, acts, waits for “the network and DOM to settle,” then writes the resolved locator to cache. “On the next run, Momentic replays from cache, no LLM call, until something changes” — and only when “the cached locator misses, auto-heal uses the AI agent to find the element again and updates the cache” (How it works). This is the inversion that controls both cost and latency: the LLM is invoked “only when it’s actually needed.”

Mermaid source

flowchart LR
  classDef io fill:#fdf4e8,stroke:#d97706,stroke-width:1.5px,color:#0f172a;
  classDef agent fill:#eafbf1,stroke:#16a34a,stroke-width:1.5px,color:#0f172a;
  classDef cache fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a;
  classDef data fill:#e8f1fd,stroke:#2563eb,stroke-width:1.5px,color:#0f172a;

  Step(["NL step<br/>'Click the Sign in button'"]):::io

  subgraph Resolve["Resolve a step"]
    direction TB
    Hit{"Step cache hit?<br/>signals match live page?"}:::cache
    Replay("Replay from cache<br/>~300ms · no LLM call"):::cache
    Heal("Auto-heal: locator agent<br/>re-resolves NL vs DOM + a11y + screenshot<br/>~5s · 1 LLM completion"):::agent
  end

  subgraph Act["Act + verify"]
    direction TB
    Do("Issue action<br/>click · type · scroll · check"):::agent
    Settle("Stability check<br/>wait for network + DOM to settle"):::data
  end

  Save[("Write resolved locator<br/>+ intent conditions to step cache")]:::cache
  Done(["Step done"]):::io

  Step --> Hit
  Hit -->|hit ~95%| Replay --> Do
  Hit -->|miss| Heal --> Do
  Do --> Settle --> Save --> Done
  Save -. "next run" .-> Hit

A cached step “stores more than one way to find its target: where the element sits on screen, what it looks like, what text it contains, and the accessibility and structural attributes around it” — a multi-modal locator. Which signals matter “is inferred from the step’s natural-language description”: “the red Cancel button below the Order Summary header” leans visual+positional; “the Sign in button” leans accessibility+text (step cache, Playwright cmp). Step-based tests are “deterministic and fast”; the act primitive runs agentic flows where “you give Momentic a goal, and an AI agent figures out the steps on the fly” — and the V3 act agent is “planner-style … drafts the full flow up front, caches the resolved steps … and self-heals” (agentic).

The cache plane: an OLAP database doing OLTP work

The hard engineering is in the cache store. Adding signals to the key took Momentic from “around 80k active cache entries to now approximately 1B”, and the original “single table in Postgres … started to show cracks”: “lock contention from queries trying to read and write to the cache concurrently” (ClickHouse blog). They moved the store to ClickHouse, exploiting its sparse primary index: the cache is keyed by “test ID, step ID, Momentic version, git branch, and commit timestamp,” so a known-key lookup “narrow[s] down the search space to just a few granules” instead of a B-tree scan that grows with data.

Mermaid source

flowchart LR
  classDef io fill:#fdf4e8,stroke:#d97706,stroke-width:1.5px,color:#0f172a;
  classDef agent fill:#eafbf1,stroke:#16a34a,stroke-width:1.5px,color:#0f172a;
  classDef cache fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a;
  classDef data fill:#e8f1fd,stroke:#2563eb,stroke-width:1.5px,color:#0f172a;
  classDef old fill:#fdecec,stroke:#e0564f,stroke-width:1.5px,color:#0f172a;

  CLI(["CLI run · local or CI"]):::io

  subgraph Key["Cache key (composite)"]
    direction TB
    K("test ID · step ID<br/>CLI version · git branch<br/>commit timestamp"):::data
  end

  subgraph Intent["Intent conditions (locator agent emits)"]
    direction TB
    Attr("Attributes<br/>text · color · role · arbitrary HTML"):::agent
    Rel("Related elements<br/>'login above sign-up'"):::agent
  end

  subgraph CH["ClickHouse · cache plane"]
    direction TB
    RMT[("ReplacingMergeTree<br/>sparse primary index · insert-only TTL")]:::cache
    MV[("Materialized view<br/>available commit timestamps per test")]:::cache
    RMT --- MV
  end

  Old["was: single Postgres table + Redis<br/>lock contention at ~1B entries"]:::old

  CLI -->|"resolve query"| Key
  Key --> CH
  Intent --> RMT
  CH -->|"~250ms avg · 95%+ hit"| CLI
  Old -. "migrated: double-write -> double-read check -> cutover" .-> CH

Two ClickHouse-native moves carry the design. Main-branch scans still read “500k+ rows,” so they added “a materialized view to precompute all of the available commit timestamps for a given test ID,” narrowing back to “one or two parts.” And because “2/3 queries are updates, which aren’t very performant” in ClickHouse, they went insert-only: SELECT, re-INSERT used caches to extend TTL, INSERT new caches, “and let ClickHouse take care of deduplicating entries asynchronously” via ReplacingMergeTree — “such an improvement that we were able to fully eliminate the Redis layer.” The cutover was a careful double-write → double-read consistency check → gradual cutover (ClickHouse blog). Result: “over two million cache queries per day, processing almost 20 billion cache entries every day while maintaining ~250ms resolution latency on average.”

Intent, not selectors

The reliability claim hinges on caching user intent rather than a DOM snapshot. The earlier “does this look like the element we saw before?” check failed four ways at scale: cross-branch pollution, cross-version pollution, false misses (randomized classnames bust the cache), and false hits (nth-child selectors grab the wrong row when order changes) (intent blog). The fix: the locator agent now “classif[ies] which attributes it used in its reasoning” and emits two condition types — attributes (“text, color, or any arbitrary HTML attribute”) and related elements (“the login button above the sign up button”). The question became “does this element still match what the user meant?” — so “the blue button” strictly enforces blue. Branch/version isolation was solved by git-aware cache seeding: new branches “seed from the cache at their merge base,” and merges fold the branch cache back into main (step cache, intent blog).

Healing as a code change

Two healing tiers: in-run auto-heal re-resolves locators and waits for stability, persisting fixes only as cache entries when the run is eligible to save cache (auto-heal). The post-run triage agent (momentic ai triage / heal) “permanently rewrites the failing tests, and opens a pull request (or emits a patch)” — respecting the repo’s PULL_REQUEST_TEMPLATE.md. A separate app graph models coverage from run traces: each UI state is “fingerprinted (canonical URL plus a normalized, minhashed view of the DOM),” a semantic summary is “embedded,” and states cluster into “product areas, features, journeys, variants” to show which flows are Covered / Partial / Missing (app graph).

Team & process

Two founders, ~12 people at the Series A; San Francisco, on-site (YC).

Role	Person	Source
Co-founder / CEO	Wei-Wei Wu (ex-Assembled; founding eng at Nashi → acq. Density 2021; staff eng at Density)	YC
Co-founder	Jeff An (ex-Splunk, Google; led testing at Robinhood, enterprise quality at Retool)	YC
Engineering	Henry Haefliger (author of the caching engineering posts)	ClickHouse blog, intent blog

The founder DNA is testing and reliability at scale — Jeff An “led testing at Robinhood and enterprise quality at Retool”; Wu led “product reliability” at Density (YC). The product philosophy is tests-as-code, engineer-owned: Momentic is “CLI-first … authoring and running tests in the cloud is deprecated” (Docs), tests are YAML in the repo, and the company markets “a migration … from outsourced QA to engineering-owned tests” (blog). Cache eligibility is git-aware (CI always saves; local saves only off main/protected branches), and healing is wired into the SCM workflow — a successful heal can open a PR, draft PR, direct commit, patch, or leave changes on disk (step cache, auto-heal). The stated creed: “truth-driven development … you cannot verify what you cannot reason,” keeping behavioral tests green “at Cursor speed” (blog). Open roles are GTM (founding AE/SDR) plus a “Founding Engineer (Frontend)” — a sales-led growth phase on a still-tiny eng team (Ashby, YC).

Sources

Reconstructed from public sources only — no insider information. Crawled 2026-06-08 via Chrome MCP (logged-out browsing) + the public docs, engineering blog, GitHub org, Ashby board, and YC profile. Claim tiers: verified (stated on a public page, linked) · inferred (reasoned from a cited signal, confidence flagged) · speculative (best-practice fill-in, labeled). Links are live; pages change, so the supporting quote for each claim is kept in this repo’s evidence map (evidence/momentic-evidence-map.md).

#	Source	Link
S1	Homepage	https://momentic.ai/
S2	Docs — Welcome	https://momentic.ai/docs
S3	Docs — How Momentic works	https://momentic.ai/docs/get-started/how-momentic-works
S4	Docs — Step caching	https://momentic.ai/docs/reliability/step-cache
S5	Docs — Auto-healing	https://momentic.ai/docs/reliability/auto-heal
S6	Docs — Agentic testing	https://momentic.ai/docs/core-concepts/agentic-testing
S7	Docs — Finding elements	https://momentic.ai/docs/core-concepts/finding-elements
S8	Docs — App graph	https://momentic.ai/docs/ai/app-graph
S9	Docs — Memory	https://momentic.ai/docs/ai/memory
S10	Docs — momentic.config.yaml	https://momentic.ai/docs/configuration/momentic-config
S11	Docs — AI configuration	https://momentic.ai/docs/configuration/ai
S12	Docs — vs Playwright	https://momentic.ai/docs/comparisons/playwright
S13	Blog — Postgres → ClickHouse	https://momentic.ai/blog/postgres-to-clickhouse-migration
S14	Blog — Intent-based caching	https://momentic.ai/blog/teaching-browser-agents-user-intent
S15	Blog index	https://momentic.ai/blog
S16	GitHub org (momentic-ai)	https://github.com/momentic-ai
S17	GitHub — skills (Claude Agent SDK)	https://github.com/momentic-ai/skills
S18	Ashby job board	https://jobs.ashbyhq.com/momentic
S19	Y Combinator profile	https://www.ycombinator.com/companies/momentic
S20	TechCrunch — $15M Series A	https://techcrunch.com/2025/11/24/momentic-raises-15m-to-automate-software-testing/