verification gates analysis

threads that verify before declaring done (test runs, reviews, build checks) vs threads that don’t.

key finding

verification gates correlate with 17 percentage points higher success rate.

metric	with verification	without verification	delta
success rate	78.2%	61.3%	+16.9pp
committed rate	25.4%	18.1%	+7.3pp
resolved rate	52.7%	43.2%	+9.5pp
frustrated rate	2.0%	0.6%	+1.4pp
avg messages	119	24	+95

distribution

total threads analyzed: 4,656
with verification gates: 2,802 (60%)
without verification gates: 1,854 (40%)

verification type frequency

type	count	% of verified threads
explicit verify phrases	2,369	84%
test runs	1,585	57%
build checks	1,533	55%
lint checks	1,286	46%
verification confirm	1,195	43%
review requests	520	19%

interpretation

the verification gap is real

threads with explicit test runs, build checks, or review requests end in committed/resolved state 78% of the time vs 61% for threads without. this is a meaningful signal—not just correlation with longer threads.

caveat: message count confound

verified threads average 119 messages vs 24 for unverified. longer threads naturally include more verification steps AND have more opportunity to reach resolution. the causality arrow could go both ways:

verification → higher success (the optimistic read)
longer threads → both more verification AND more success (the confound)

frustration paradox

verified threads show HIGHER frustration rate (2.0% vs 0.6%). hunch: verification surfaces problems. unverified threads that would have failed just… stop without the user realizing. verification makes failures visible.

high-verification exemplars

threads with 3+ verification patterns show strong committed outcomes:

T-00298580: 37 verification moments, ended COMMITTED
T-019afee0-7141: 53 verification moments, ended COMMITTED
T-0093d6c6: 32 verification moments, ended RESOLVED

common pattern: go test / pnpm test / vitest interspersed throughout, with “tests pass” confirmation before ship.

unverified success cases

some threads reach COMMITTED/RESOLVED without verification:

short exploratory threads (avg 24 messages)
quick lookups or config changes
contexts where verification isn’t applicable

these aren’t failures of process—they’re appropriately scoped tasks.

recommendations

for implementation tasks: always include at least one verification gate (test run, build check) before declaring done
for exploratory tasks: verification not required—these are information-gathering
for debugging tasks: verification is the whole point—run the failing test, confirm the fix
“ship it” without verification: treat as a smell. the 18% committed rate without verification suggests many of these may have shipped bugs

methodology

patterns detected via regex:

test_run: pnpm test, go test, vitest, pytest, etc.
build_check: pnpm build, go build, tsc, cargo check, etc.
lint_check: eslint, golint, cargo clippy, etc.
review_request: “review the diff”, “do a deep review”, etc.
verification_confirm: “tests pass”, “build succeeded”, etc.

outcome determined from final 3 user/assistant messages using keyword matching.