comparative benchmarks

performance thresholds derived from 4,656 thread analysis. use to evaluate thread quality and user behavior.

thread outcome metrics

metric	🟢 excellent	🟡 good	🔴 poor	notes
resolution rate	>60%	45-60%	<45%	baseline: 44% resolved
committed rate	>12%	7-12%	<7%	indicates ship velocity
handoff rate	<10%	10-15%	>15%	lower = better ownership
frustration rate	0%	<1%	>1%	14 frustrated = 0.3% baseline

thread length & flow

metric	🟢 excellent	🟡 good	🔴 poor	notes
turn count	26-50	10-25 or 51-75	<10 or >100	sweet spot: 75% success at 26-50
collaboration intensity	<50 msg/hr	50-200 msg/hr	>500 msg/hr	84% vs 55% success
steering events	0	1-2	3+	no_steering: 37% vs high: 61% (but indicates problems)

prompting quality

metric	🟢 excellent	🟡 good	🔴 poor	notes
prompt length	300-1500 chars	100-300 or 1500-3000	<100 or >3000	lowest steering rate
file references	included (@path)	partial context	none	+25pp success with refs
question density	<5%	5-15%	>15%	76% resolve at <5%
specificity	explicit task + context	task only	vague/exploratory	file refs = proxy

agent behavior

metric	🟢 excellent	🟡 good	🔴 poor	notes
task tool usage	2-6 tasks	1 or 7-10	0 or 11+	77-79% success at 2-6
error handling	fix root cause	workaround	suppress	71.6% suppress (bad baseline)
instruction compliance	>80%	50-80%	<50%	current: ~20% on prohibitions
oracle usage	proactive (planning)	reactive (recovery)	rescue-only	46% in FRUSTRATED = misuse

user behavior signals

metric	🟢 excellent	🟡 good	🔴 poor	notes
wtf rate	0%	<5%	>10%	3.5% in resolved, 33% in frustrated
approval rate	any approval	-	no approvals	94% vs 49% persistence
rejection rate	<20%	20-40%	>40%	REJECTION = 47% of steering

temporal patterns

metric	🟢 excellent	🟡 good	🔴 poor	notes
time of day	2-9am	10am-5pm	6-9pm	60% vs 27.5% resolution
day of week	weekend	weekday AM	weekday PM	+5.2pp weekend premium

anti-pattern thresholds

anti-pattern	🟢 absent	🟡 minor	🔴 severe	detection signal
read/grep thrashing	0 cycles	1-2 cycles	3+ cycles	0% success pattern
oracle rescue	oracle in first half	oracle in second half	oracle only after failure	timing matters
skill underuse	3+ skills/thread	1-2 skills	report-only	97% report = underuse
context loss	<5 re-reads	5-10 re-reads	>10 re-reads	re-reading same files

composite scoring

thread health score (0-100)

score = (
  resolution_component × 30 +     # resolved/committed = 30, handoff = 15, frustrated = 0
  length_component × 20 +          # 26-50 = 20, 10-75 = 15, else = 5
  steering_component × 15 +        # 0 steering = 15, 1-2 = 10, 3+ = 5
  prompting_component × 20 +       # file refs + 300-1500 chars = 20, partial = 10
  tool_usage_component × 15        # 2-6 tasks + proactive oracle = 15
)

score	grade	interpretation
80-100	A	excellent execution, model for others
60-79	B	good thread, minor improvements possible
40-59	C	functional but inefficient
20-39	D	significant problems, review needed
0-19	F	failure mode, autopsy recommended

usage notes

thresholds derived from observed distribution, not idealized targets
“excellent” = top ~10-15% of observed behavior
“poor” = bottom ~20% or known failure correlates
some metrics inversely related (high steering → high resolution, but indicates upstream problem)
temporal metrics may reflect selection bias (who works at 3am?)