An agent shipped a payment-retry fix. Its one end-to-end test passed, the readiness gate went green, the intent moved to Shipped. Seven days later the retry was firing twice on slow networks — checkout conversion was down 4% and no test caught it, because the test only proved the happy path ran.
The spec had a Verification section. It just had a test list, not a loop. The gate had asked the only question it knows how to ask — does at least one check exist? — and never asked the one that mattered: how will we know, in production, that the intent actually held?
Key Takeaway: A test list proves the code runs. A feedback loop proves the intent held.
The green-tests trap
Pathmode's readiness gate marks verification "met" the moment a single check exists — one manual check, one unit test, or one e2e test is enough to flip it green. That's deliberate: the gate's job is to unblock shipping, not to grade your verification. But it means passing the gate is not running the loop.
The tradeoff is the whole point of this guide. The thin gate pushes the real work — deciding what's worth proving — onto you. A test list lets you off the hook the moment something is green. A loop doesn't close until the intent is confirmed in the world.
"Did it run" vs. "did the intent hold"
A test answers a narrow question: did this code path execute as written? Useful — and it's the How layer's job. The tools own the tests.
Verification-as-a-loop answers a different question: after the agent walked away, is the user problem you started from actually solved, and is everything around it still standing? That's the Why layer's job, and it's the part no passing test can answer for you.
This is the line the opening incident crossed. The retry test was green. The intent — users complete payment without being double-charged — had quietly broken. Context isn't judgment: a passing test is context; choosing the four things worth proving is judgment.
The four legs of the loop
A complete loop answers four questions. Map them onto what the spec actually carries:
- The fastest check. The unit or e2e test that catches a regression in seconds. Lives in
verification.unitTests/verification.e2eTests. - The manual fallback. What a human runs when automation can't reach — a flow you click through, a query you eyeball. Lives in
verification.manualChecks. - The observable signal in production. The metric, log, or event that confirms it shipped correctly, not just that it shipped — conversion held, error rate flat, the p95 you promised. This one has no field. You state it in the conversation and run it after ship; it becomes the manual or monitoring check you actually watch.
- What must not regress. The adjacent flow this change could quietly break. Lives in Health Metrics, the spec's "what must not degrade" field.
Legs 1 and 2 fit the three buckets of the verification field. Leg 4 is Health Metrics. Leg 3 is the one with no home in the schema — which is exactly why teams skip it, and exactly why the payment fix shipped a 4% hole nobody was watching.
💡 Tip: If every leg of your loop is a unit test, you've written a test list. A real loop almost always has at least one check that runs after the merge.
Health Metrics carry the regression leg
Outcomes say what should change. Health Metrics say what must not. The agent optimizes hard for the outcomes you handed it — so name the things it could break getting there, or it won't know to protect them.
This isn't abstract: the in-app Verification Checkpoint builds its checklist from three sources — your outcomes, your Constitution rules, and your health metrics. A health metric you never wrote is a regression nobody is asked to confirm. (For what the Health Metrics and Verification fields hold, see The Anatomy of an Agent-Ready Spec — this guide is about operating them, not defining them.)
Close the loop inside the workspace
A loop that only opens is a wish. Pathmode gives you two ways to close it:
- The human Verification Checkpoint — walk the checklist (outcomes, constitution rules, health metrics) and record what actually held before you mark the intent Shipped or Verified. It doesn't force every box: unchecked items stay visible, so a gap is a decision you can see rather than one you skipped silently.
- AI verification — Pathmode grades the implementation against each outcome, constraint, constitution rule, and edge case, and returns a pass/fail with a 0–100 score and per-item reasoning.
Be honest about what the AI grade is: it reviews the implementation summary (and, optionally, the diff) — it's an AI QA reviewer, not a CI run, and Pathmode doesn't watch your production metrics for you. What both paths do is write the result back onto the intent and log it as an implementation note. So the loop's output becomes evidence on the next iteration — which is what makes it a loop and not a checkbox.
Write the loop, not the list
Before you hand a spec to an agent, your verification shouldn't be N tests. It should answer four questions:
- What's the fastest check that catches a regression?
- What's the manual fallback when automation can't reach?
- What's the observable signal in production that proves the intent held?
- What must not regress — and is it a Health Metric?
If you can answer the first two and go blank on the last two, you've found the legs you were about to skip. That blank is where the 4% hides.
Key Takeaway: The gate asks whether a check exists. The loop asks whether the intent held. Only one of them notices when checkout quietly drops.
Open any shipped intent and run its Verification Checkpoint — the leg you can't check off is the one worth writing next.