verification gates analysis
threads that verify before declaring done (test runs, reviews, build checks) vs threads that don’t.
key finding
verification gates correlate with 17 percentage points higher success rate.
| metric | with verification | without verification | delta |
|---|---|---|---|
| success rate | 78.2% | 61.3% | +16.9pp |
| committed rate | 25.4% | 18.1% | +7.3pp |
| resolved rate | 52.7% | 43.2% | +9.5pp |
| frustrated rate | 2.0% | 0.6% | +1.4pp |
| avg messages | 119 | 24 | +95 |
distribution
- total threads analyzed: 4,656
- with verification gates: 2,802 (60%)
- without verification gates: 1,854 (40%)
verification type frequency
| type | count | % of verified threads |
|---|---|---|
| explicit verify phrases | 2,369 | 84% |
| test runs | 1,585 | 57% |
| build checks | 1,533 | 55% |
| lint checks | 1,286 | 46% |
| verification confirm | 1,195 | 43% |
| review requests | 520 | 19% |
interpretation
the verification gap is real
threads with explicit test runs, build checks, or review requests end in committed/resolved state 78% of the time vs 61% for threads without. this is a meaningful signal—not just correlation with longer threads.
caveat: message count confound
verified threads average 119 messages vs 24 for unverified. longer threads naturally include more verification steps AND have more opportunity to reach resolution. the causality arrow could go both ways:
- verification → higher success (the optimistic read)
- longer threads → both more verification AND more success (the confound)
frustration paradox
verified threads show HIGHER frustration rate (2.0% vs 0.6%). hunch: verification surfaces problems. unverified threads that would have failed just… stop without the user realizing. verification makes failures visible.
high-verification exemplars
threads with 3+ verification patterns show strong committed outcomes:
T-00298580: 37 verification moments, ended COMMITTEDT-019afee0-7141: 53 verification moments, ended COMMITTEDT-0093d6c6: 32 verification moments, ended RESOLVED
common pattern: go test / pnpm test / vitest interspersed throughout, with “tests pass” confirmation before ship.
unverified success cases
some threads reach COMMITTED/RESOLVED without verification:
- short exploratory threads (avg 24 messages)
- quick lookups or config changes
- contexts where verification isn’t applicable
these aren’t failures of process—they’re appropriately scoped tasks.
recommendations
- for implementation tasks: always include at least one verification gate (test run, build check) before declaring done
- for exploratory tasks: verification not required—these are information-gathering
- for debugging tasks: verification is the whole point—run the failing test, confirm the fix
- “ship it” without verification: treat as a smell. the 18% committed rate without verification suggests many of these may have shipped bugs
methodology
patterns detected via regex:
- test_run:
pnpm test,go test,vitest,pytest, etc. - build_check:
pnpm build,go build,tsc,cargo check, etc. - lint_check:
eslint,golint,cargo clippy, etc. - review_request: “review the diff”, “do a deep review”, etc.
- verification_confirm: “tests pass”, “build succeeded”, etc.
outcome determined from final 3 user/assistant messages using keyword matching.