comparative benchmarks
performance thresholds derived from 4,656 thread analysis. use to evaluate thread quality and user behavior.
thread outcome metrics
| metric | π’ excellent | π‘ good | π΄ poor | notes |
|---|
| resolution rate | >60% | 45-60% | <45% | baseline: 44% resolved |
| committed rate | >12% | 7-12% | <7% | indicates ship velocity |
| handoff rate | <10% | 10-15% | >15% | lower = better ownership |
| frustration rate | 0% | <1% | >1% | 14 frustrated = 0.3% baseline |
thread length & flow
| metric | π’ excellent | π‘ good | π΄ poor | notes |
|---|
| turn count | 26-50 | 10-25 or 51-75 | <10 or >100 | sweet spot: 75% success at 26-50 |
| collaboration intensity | <50 msg/hr | 50-200 msg/hr | >500 msg/hr | 84% vs 55% success |
| steering events | 0 | 1-2 | 3+ | no_steering: 37% vs high: 61% (but indicates problems) |
prompting quality
| metric | π’ excellent | π‘ good | π΄ poor | notes |
|---|
| prompt length | 300-1500 chars | 100-300 or 1500-3000 | <100 or >3000 | lowest steering rate |
| file references | included (@path) | partial context | none | +25pp success with refs |
| question density | <5% | 5-15% | >15% | 76% resolve at <5% |
| specificity | explicit task + context | task only | vague/exploratory | file refs = proxy |
agent behavior
| metric | π’ excellent | π‘ good | π΄ poor | notes |
|---|
| task tool usage | 2-6 tasks | 1 or 7-10 | 0 or 11+ | 77-79% success at 2-6 |
| error handling | fix root cause | workaround | suppress | 71.6% suppress (bad baseline) |
| instruction compliance | >80% | 50-80% | <50% | current: ~20% on prohibitions |
| oracle usage | proactive (planning) | reactive (recovery) | rescue-only | 46% in FRUSTRATED = misuse |
user behavior signals
| metric | π’ excellent | π‘ good | π΄ poor | notes |
|---|
| wtf rate | 0% | <5% | >10% | 3.5% in resolved, 33% in frustrated |
| approval rate | any approval | - | no approvals | 94% vs 49% persistence |
| rejection rate | <20% | 20-40% | >40% | REJECTION = 47% of steering |
temporal patterns
| metric | π’ excellent | π‘ good | π΄ poor | notes |
|---|
| time of day | 2-9am | 10am-5pm | 6-9pm | 60% vs 27.5% resolution |
| day of week | weekend | weekday AM | weekday PM | +5.2pp weekend premium |
anti-pattern thresholds
| anti-pattern | π’ absent | π‘ minor | π΄ severe | detection signal |
|---|
| read/grep thrashing | 0 cycles | 1-2 cycles | 3+ cycles | 0% success pattern |
| oracle rescue | oracle in first half | oracle in second half | oracle only after failure | timing matters |
| skill underuse | 3+ skills/thread | 1-2 skills | report-only | 97% report = underuse |
| context loss | <5 re-reads | 5-10 re-reads | >10 re-reads | re-reading same files |
composite scoring
thread health score (0-100)
score = (
resolution_component Γ 30 + # resolved/committed = 30, handoff = 15, frustrated = 0
length_component Γ 20 + # 26-50 = 20, 10-75 = 15, else = 5
steering_component Γ 15 + # 0 steering = 15, 1-2 = 10, 3+ = 5
prompting_component Γ 20 + # file refs + 300-1500 chars = 20, partial = 10
tool_usage_component Γ 15 # 2-6 tasks + proactive oracle = 15
)
| score | grade | interpretation |
|---|
| 80-100 | A | excellent execution, model for others |
| 60-79 | B | good thread, minor improvements possible |
| 40-59 | C | functional but inefficient |
| 20-39 | D | significant problems, review needed |
| 0-19 | F | failure mode, autopsy recommended |
usage notes
- thresholds derived from observed distribution, not idealized targets
- βexcellentβ = top ~10-15% of observed behavior
- βpoorβ = bottom ~20% or known failure correlates
- some metrics inversely related (high steering β high resolution, but indicates upstream problem)
- temporal metrics may reflect selection bias (who works at 3am?)