MEASUREMENT FRAMEWORK
operational KPIs for amp thread quality monitoring
OVERVIEW
this framework defines what to measure, how often, and baseline targets derived from 4,656 thread analysis.
TIER 1: CRITICAL KPIS (daily tracking)
1.1 resolution rate
| metric | baseline | target | red line |
|---|
| RESOLVED+COMMITTED % | 51% | >60% | <40% |
| FRUSTRATED % | <1% | <0.5% | >2% |
how to measure: classify thread outcome at close. count by status daily.
data source: thread metadata, closing message classification
1.2 approval:steering ratio
| metric | baseline | target | red line |
|---|
| ratio (team avg) | ~2.5:1 | >3:1 | <1.5:1 |
| steering density | ~5% | <5% | >8% |
how to measure: count user messages classified as APPROVAL vs STEERING per thread. aggregate weekly by user.
data source: user message classification (imperative detection, correction phrases)
1.3 thread length distribution
| zone | current % | target % | action if violated |
|---|
| <10 turns | ~15% | <10% | flag as abandoned |
| 26-50 turns (sweet spot) | ~20% | >30% | optimize toward |
| >100 turns | ~8% | <5% | mandatory handoff |
how to measure: count turns per thread at close. bucket into zones.
TIER 2: QUALITY SIGNALS (weekly tracking)
2.1 prompt quality
| signal | baseline | target | measurement |
|---|
| opener 300-1500 chars | ~40% | >60% | first user message length |
| file refs in opener | ~25% | >40% | @ or file path in first msg |
| interrogative/descriptive style | ~50% | >65% | sentence structure classification |
| metric | baseline | target | red line |
|---|
| Task tool usage (2-6/thread) | ~35% | >50% | <20% |
| oracle for planning (not rescue) | ~25% | >40% | track early vs late invocation |
| skill invocations | low | increase | especially dig skill |
2.3 verification gates
| metric | baseline | target |
|---|
| threads with verification | ~40% | >60% |
| build/test run before close | ~50% | >70% |
TIER 3: BEHAVIORAL PATTERNS (monthly tracking)
3.1 anti-pattern frequency
| pattern | current rate | target | detection method |
|---|
| SHORTCUT_TAKING | ~30% of frustrated | <10% | code review signals |
| TEST_WEAKENING | ~20% of frustrated | 0% | assertion removal detection |
| PREMATURE_COMPLETION | common | reduce 50% | “done” before verification |
| NO_DELEGATION | ~40% | <25% | threads with 0 Task calls |
3.2 user-level trends
track per-user monthly:
| metric | purpose |
|---|
| resolution rate | individual effectiveness |
| avg turns to resolution | efficiency |
| steering density | collaboration quality |
| handoff rate | task scoping issues |
3.3 temporal patterns
| metric | baseline | monitoring purpose |
|---|
| 6-9pm resolution rate | 27.5% | avoid critical work |
| weekend delta | +5.2pp | confirm pattern holds |
| msgs/hr distribution | varies | pace optimization |
BASELINE VALUES (from 4,656 threads)
outcome distribution (current state)
| status | % | count |
|---|
| RESOLVED | 59% | 2,745 |
| UNKNOWN | 33% | 1,560 |
| HANDOFF | 1.6% | 75 |
| COMMITTED | 7% | 305 |
| EXPLORATORY | 3% | 124 |
| FRUSTRATED | <1% | 14 |
success thresholds (validated)
| metric | green | yellow | red |
|---|
| turns | 26-50 | 10-25 or 51-100 | <10 or >100 |
| approval:steering | >2:1 | 1-2:1 | <1:1 |
| steering density | <5% | 5-8% | >8% |
| prompt length | 300-1500 | 100-300 or 1500-3000 | <100 or >3000 |
| Task usage | 2-6 | 1 or 7-10 | 0 or 11+ |
MEASUREMENT CADENCE
daily
weekly
monthly
ALERTING THRESHOLDS
| condition | action |
|---|
| 2+ FRUSTRATED threads in 24h | root cause analysis |
| user approval:steering <1:1 for 3+ threads | intervention/coaching |
| >50% threads <10 turns for a user | check prompt quality |
| steering→steering transition >40% | systemic issue |
weekly review triggers
| condition | review |
|---|
| resolution rate drops >10pp | investigate pattern shift |
| new anti-pattern cluster | update catalog |
| Task usage <20% | training opportunity |
DATA COLLECTION REQUIREMENTS
per thread (automatic)
thread_id
user_id
start_timestamp
end_timestamp
turn_count
outcome_status
first_msg_length
file_refs_in_opener
tools_used: { task_count, oracle_count, skill_invocations }
verification_present: bool
per message (automatic)
message_id
thread_id
role: user|assistant
timestamp
char_count
classification: approval|steering|neutral|question
derived (computed)
approval_steering_ratio
steering_density
msgs_per_hour
time_to_resolution
question_density
SUCCESS CRITERIA FOR FRAMEWORK
this framework succeeds if:
- FRUSTRATED threads trend to 0 (currently 14/4656)
- resolution rate increases to >60% (currently 51%)
- sweet spot (26-50 turns) threads increase to >30%
- approval:steering ratio team avg >3:1
- anti-pattern recurrence decreases measurably
IMPLEMENTATION PRIORITY
- week 1: instrument basic outcome tracking (status, turns)
- week 2: add message classification (approval/steering)
- week 3: prompt quality signals
- week 4: tool usage tracking
- ongoing: anti-pattern detection refinement
framework derived from analysis of 4,656 amp threads