MEASUREMENT FRAMEWORK

operational KPIs for amp thread quality monitoring

OVERVIEW

this framework defines what to measure, how often, and baseline targets derived from 4,656 thread analysis.

TIER 1: CRITICAL KPIS (daily tracking)

1.1 resolution rate

metric	baseline	target	red line
RESOLVED+COMMITTED %	51%	>60%	<40%
FRUSTRATED %	<1%	<0.5%	>2%

how to measure: classify thread outcome at close. count by status daily.

data source: thread metadata, closing message classification

1.2 approval:steering ratio

metric	baseline	target	red line
ratio (team avg)	~2.5:1	>3:1	<1.5:1
steering density	~5%	<5%	>8%

how to measure: count user messages classified as APPROVAL vs STEERING per thread. aggregate weekly by user.

data source: user message classification (imperative detection, correction phrases)

1.3 thread length distribution

zone	current %	target %	action if violated
<10 turns	~15%	<10%	flag as abandoned
26-50 turns (sweet spot)	~20%	>30%	optimize toward
>100 turns	~8%	<5%	mandatory handoff

how to measure: count turns per thread at close. bucket into zones.

TIER 2: QUALITY SIGNALS (weekly tracking)

2.1 prompt quality

signal	baseline	target	measurement
opener 300-1500 chars	~40%	>60%	first user message length
file refs in opener	~25%	>40%	`@` or file path in first msg
interrogative/descriptive style	~50%	>65%	sentence structure classification

2.2 tool usage health

metric	baseline	target	red line
Task tool usage (2-6/thread)	~35%	>50%	<20%
oracle for planning (not rescue)	~25%	>40%	track early vs late invocation
skill invocations	low	increase	especially `dig` skill

2.3 verification gates

metric	baseline	target
threads with verification	~40%	>60%
build/test run before close	~50%	>70%

TIER 3: BEHAVIORAL PATTERNS (monthly tracking)

3.1 anti-pattern frequency

pattern	current rate	target	detection method
SHORTCUT_TAKING	~30% of frustrated	<10%	code review signals
TEST_WEAKENING	~20% of frustrated	0%	assertion removal detection
PREMATURE_COMPLETION	common	reduce 50%	“done” before verification
NO_DELEGATION	~40%	<25%	threads with 0 Task calls

3.2 user-level trends

track per-user monthly:

metric	purpose
resolution rate	individual effectiveness
avg turns to resolution	efficiency
steering density	collaboration quality
handoff rate	task scoping issues

3.3 temporal patterns

metric	baseline	monitoring purpose
6-9pm resolution rate	27.5%	avoid critical work
weekend delta	+5.2pp	confirm pattern holds
msgs/hr distribution	varies	pace optimization

BASELINE VALUES (from 4,656 threads)

outcome distribution (current state)

status	%	count
RESOLVED	59%	2,745
UNKNOWN	33%	1,560
HANDOFF	1.6%	75
COMMITTED	7%	305
EXPLORATORY	3%	124
FRUSTRATED	<1%	14

success thresholds (validated)

metric	green	yellow	red
turns	26-50	10-25 or 51-100	<10 or >100
approval:steering	>2:1	1-2:1	<1:1
steering density	<5%	5-8%	>8%
prompt length	300-1500	100-300 or 1500-3000	<100 or >3000
Task usage	2-6	1 or 7-10	0 or 11+

MEASUREMENT CADENCE

daily

resolution rate (RESOLVED + COMMITTED)
frustrated thread count (immediate investigation if >0)
new thread count

weekly

approval:steering ratio by user
thread length distribution
prompt quality signals aggregate
tool usage patterns

monthly

anti-pattern audit (sample 10% of non-resolved)
user trend analysis
temporal pattern review
framework recalibration against new data

ALERTING THRESHOLDS

immediate action required

condition	action
2+ FRUSTRATED threads in 24h	root cause analysis
user approval:steering <1:1 for 3+ threads	intervention/coaching
>50% threads <10 turns for a user	check prompt quality
steering→steering transition >40%	systemic issue

weekly review triggers

condition	review
resolution rate drops >10pp	investigate pattern shift
new anti-pattern cluster	update catalog
Task usage <20%	training opportunity

DATA COLLECTION REQUIREMENTS

per thread (automatic)

thread_id
user_id
start_timestamp
end_timestamp
turn_count
outcome_status
first_msg_length
file_refs_in_opener
tools_used: { task_count, oracle_count, skill_invocations }
verification_present: bool

per message (automatic)

message_id
thread_id
role: user|assistant
timestamp
char_count
classification: approval|steering|neutral|question

derived (computed)

approval_steering_ratio
steering_density
msgs_per_hour
time_to_resolution
question_density

SUCCESS CRITERIA FOR FRAMEWORK

this framework succeeds if:

FRUSTRATED threads trend to 0 (currently 14/4656)
resolution rate increases to >60% (currently 51%)
sweet spot (26-50 turns) threads increase to >30%
approval:steering ratio team avg >3:1
anti-pattern recurrence decreases measurably

IMPLEMENTATION PRIORITY

week 1: instrument basic outcome tracking (status, turns)
week 2: add message classification (approval/steering)
week 3: prompt quality signals
week 4: tool usage tracking
ongoing: anti-pattern detection refinement

framework derived from analysis of 4,656 amp threads