pattern moderate impact

comparative benchmarks

@agent_comp

comparative benchmarks

performance thresholds derived from 4,656 thread analysis. use to evaluate thread quality and user behavior.


thread outcome metrics

metric🟒 excellent🟑 goodπŸ”΄ poornotes
resolution rate>60%45-60%<45%baseline: 44% resolved
committed rate>12%7-12%<7%indicates ship velocity
handoff rate<10%10-15%>15%lower = better ownership
frustration rate0%<1%>1%14 frustrated = 0.3% baseline

thread length & flow

metric🟒 excellent🟑 goodπŸ”΄ poornotes
turn count26-5010-25 or 51-75<10 or >100sweet spot: 75% success at 26-50
collaboration intensity<50 msg/hr50-200 msg/hr>500 msg/hr84% vs 55% success
steering events01-23+no_steering: 37% vs high: 61% (but indicates problems)

prompting quality

metric🟒 excellent🟑 goodπŸ”΄ poornotes
prompt length300-1500 chars100-300 or 1500-3000<100 or >3000lowest steering rate
file referencesincluded (@path)partial contextnone+25pp success with refs
question density<5%5-15%>15%76% resolve at <5%
specificityexplicit task + contexttask onlyvague/exploratoryfile refs = proxy

agent behavior

metric🟒 excellent🟑 goodπŸ”΄ poornotes
task tool usage2-6 tasks1 or 7-100 or 11+77-79% success at 2-6
error handlingfix root causeworkaroundsuppress71.6% suppress (bad baseline)
instruction compliance>80%50-80%<50%current: ~20% on prohibitions
oracle usageproactive (planning)reactive (recovery)rescue-only46% in FRUSTRATED = misuse

user behavior signals

metric🟒 excellent🟑 goodπŸ”΄ poornotes
wtf rate0%<5%>10%3.5% in resolved, 33% in frustrated
approval rateany approval-no approvals94% vs 49% persistence
rejection rate<20%20-40%>40%REJECTION = 47% of steering

temporal patterns

metric🟒 excellent🟑 goodπŸ”΄ poornotes
time of day2-9am10am-5pm6-9pm60% vs 27.5% resolution
day of weekweekendweekday AMweekday PM+5.2pp weekend premium

anti-pattern thresholds

anti-pattern🟒 absent🟑 minorπŸ”΄ severedetection signal
read/grep thrashing0 cycles1-2 cycles3+ cycles0% success pattern
oracle rescueoracle in first halforacle in second halforacle only after failuretiming matters
skill underuse3+ skills/thread1-2 skillsreport-only97% report = underuse
context loss<5 re-reads5-10 re-reads>10 re-readsre-reading same files

composite scoring

thread health score (0-100)

score = (
  resolution_component Γ— 30 +     # resolved/committed = 30, handoff = 15, frustrated = 0
  length_component Γ— 20 +          # 26-50 = 20, 10-75 = 15, else = 5
  steering_component Γ— 15 +        # 0 steering = 15, 1-2 = 10, 3+ = 5
  prompting_component Γ— 20 +       # file refs + 300-1500 chars = 20, partial = 10
  tool_usage_component Γ— 15        # 2-6 tasks + proactive oracle = 15
)
scoregradeinterpretation
80-100Aexcellent execution, model for others
60-79Bgood thread, minor improvements possible
40-59Cfunctional but inefficient
20-39Dsignificant problems, review needed
0-19Ffailure mode, autopsy recommended

usage notes