ULTIMATE SYNTHESIS: amp thread analysis

the ONE document. 4,656 threads. 208,799 messages. 20 users. 9 months. 48 insight files distilled.

POWER RANKINGS: findings by impact

TIER 1: HIGHEST IMPACT (implement immediately)

rank	finding	effect size	source
1	file references in opener (@path)	+25pp success (66.7% vs 41.8%)	first-message-patterns
2	approval:steering ratio > 2:1	4x success vs <1:1	thread-flow, conversation-dynamics
3	26-50 turns sweet spot	75% success vs 14% for <10 turns	length-analysis
4	steering = engagement, not failure	60% resolution steered vs 37% unsteered	MEGA-SYNTHESIS
5	confirm before action	47% of steerings are “no…”, 17% are “wait…“	steering-deep-dive

TIER 2: HIGH IMPACT (adopt this week)

rank	finding	effect size	source
6	300-1500 char prompts optimal	lowest steering (.20-.21)	message-brevity
7	terse + high questions = best	60% resolution for this style	user-comparison
8	oracle early, not late	46% frustrated threads use oracle vs 25% resolved	oracle-timing
9	2-6 Task spawns optimal	78.6% success at 4-6 tasks	task-delegation
10	test context = 2.15x resolution	56.7% vs 26.3%	testing-patterns

TIER 3: MODERATE IMPACT (adopt this month)

rank	finding	effect size	source
11	multi-file threads outperform	72% vs 47% for single-file	multi-file-edits
12	weekend premium	+5.2pp resolution (48.9% vs 43.7%)	weekend-analysis
13	late night/early morning best	60% resolution 2-5am vs 27.5% 6-9pm	time-analysis
14	interrogative style wins	69.3% success rate	prompting-styles
15	commit/push imperatives	89.2% resolution	imperative-analysis

TIER 4: NUANCED (context-dependent)

rank	finding	effect size	source
16	low question density = higher resolution	76% for <5% questions	question-analysis
17	learning is real	66% reduction in turn count over 8 months (@verbose_explorer)	learning-curves
18	refactoring succeeds 3x more than migration	63.3% vs 20.7%	refactoring-patterns
19	87% steering recovery rate	only 9.4% cascade to another steering	conversation-dynamics
20	collaborative openers (“we”, “let’s”) = longest threads	249 avg messages	opening-words

FRUSTRATION PREDICTION: early warning system

the doom spiral sequence

STAGE 0: agent takes shortcut (invisible)
    ↓
STAGE 1: "no" / "wait" / "actually" (50% recovery)
    ↓
STAGE 2: consecutive steerings (40% recovery)
    ↓
STAGE 3: "wtf" / "fucking" / ALL CAPS (20% recovery)
    ↓
STAGE 4: "NOOOOOOOO" / profanity explosion (<10% recovery)

quantitative intervention thresholds

metric	yellow	red
approval:steering ratio	< 2:1	< 1:1
consecutive steerings	2	3+
turns without approval	15	25
steering density	> 5%	> 8%

frustration risk formula

risk = (steering_count × 2) 
     + (consecutive_steerings × 3)
     + (simplification_detected × 4)
     + (test_weakening_detected × 5)
     - (approval_count × 2)
     - (file_reference_in_opener × 3)

thresholds:
  >= 3: suggest rephrasing approach
  >= 6: suggest oracle or spawn
  >= 10: offer handoff to fresh thread

USER ARCHETYPES & CHEAT SHEETS

@concise_commander: the marathon debugger

1,219 threads | 86.5 avg turns | 60.5% resolution
terse (263 chars) | 23% questions | high steering (0.81)
domain: storage engine, performance, SIMD

what works: socratic questioning (“OK, what’s next?”), marathon persistence, explicit approvals what triggers steering: premature action, forgetting flags (-run=xxx), full test suites phrases: “wait”, “dont”, “NO FUCKING SHORTCUTS”

@steady_navigator: the efficient executor

1,171 threads | 36.5 avg turns | 67% resolution
moderate (547 chars) | 43% questions | LOW steering (0.10)
domain: observability, frontend, ai tooling

what works: polite structured prompts, post-hoc corrections, screenshot-driven what triggers steering: rarely (2.6% rate)—uses post-hoc rejection not interrupts phrases: “please look at”, “almost there”, “see screenshot”

@verbose_explorer: the spawn orchestrator

875 threads | 39.1 avg turns | 83% resolution (corrected)
verbose (932 chars) | 26% questions | moderate steering (0.28)
domain: devtools, personal projects, skills
spawned 231 subagents with 97.8% success rate

what works: effective spawn orchestration, long threads (78% resolution at 100+ turns), steering questions as opener what hurts: evening sessions (lower resolution 19:00-22:00) note: prior analysis miscounted spawned subagent threads as handoffs, inflating “handoff rate” to 30% and deflating resolution to 33.8%

@precision_pilot: the architect

90 threads | 72.9 avg turns | 82.2% resolution
VERY verbose (2,037 chars) | 34% questions
domain: streaming, sessions, architecture

what works: plan-oriented prompts, cross-references, multi-thread orchestration

@patient_pathfinder: the infrastructure operator

150 threads | 20.3 avg turns | 54% resolution
concise (293 chars) | 7% questions (most directive)
domain: kubernetes, prometheus, infrastructure

what works: work hours only (07-17), precise specs, minimal back-and-forth

@feature_lead: the feature spec writer

146 threads | 20.7 avg turns | 26% resolution
detailed (780 chars) | 11% questions | 45% handoff rate
domain: search_modal, analytics_service, observability features

what works: spec-and-delegate pattern, external code review integration

AGENTS.MD: COPY-PASTE READY

section 1: confirmation gates

## before taking action

confirm with user before:
- running tests/benchmarks (especially with flags like `-run=xxx`, `-bench=xxx`)
- pushing code or creating commits
- modifying files outside explicitly mentioned scope
- adding abstractions or changing existing behavior
- running full test suites instead of targeted tests

ASK: "ready to run the tests?" rather than "running the tests now..."

### flag memory

remember user-specified flags across the thread:
- benchmark flags: `-run=xxx`, `-bench=xxx`, `-benchstat`
- test filters: specific test names, package paths
- git conventions: avoid `git add -A`, use explicit file lists

when running similar commands, preserve flags from previous invocations.

section 2: steering recovery

## after receiving steering

1. acknowledge the correction explicitly
2. do NOT repeat the corrected behavior
3. if pattern recurs (2+ steerings for same issue), ask user for explicit preference
4. track common corrections for this user

### recovery expectations

- 87% of steerings should NOT be followed by another steering
- if you hit 2+ consecutive steerings, PAUSE and ask if approach should change
- after STEERING → APPROVAL sequence, user has validated the correction

section 3: thread health monitoring

## thread health indicators

### healthy signals
- approval:steering ratio > 2:1
- steady progress with occasional approvals
- spawning subtasks for parallel work
- consistent approval distribution across phases

### warning signals
- ratio drops below 1:1 — intervention needed
- 100+ turns without resolution — marathon risk
- 2+ consecutive steerings — doom spiral forming
- user messages getting longer — frustration signal

### action when unhealthy
1. pause and summarize current state
2. ask if approach should change
3. offer to spawn fresh thread with lessons learned

section 4: oracle usage

## oracle usage

### DO use oracle for
- planning before implementation
- architecture decisions
- code review pre-merge
- debugging hypotheses
- early phase ideation

### DON'T use oracle as
- last resort when stuck (too late—46% of frustrated threads reached for oracle)
- replacement for reading code
- magic fix for unclear requirements
- panic button after 100+ turns

### oracle timing
integrate EARLY (planning phase), not LATE (rescue phase). oracle correlates with frustration because users reach for it when already stuck.

section 5: optimal patterns

## optimal thread patterns

### success predictors
| metric | target | red flag |
|--------|--------|----------|
| approval:steering ratio | >2:1 | <1:1 |
| thread length | 26-50 turns | >100 without resolution |
| question density | <5% | >15% |
| steering recovery | next msg not steering | consecutive steerings |
| opening message | file refs, 300-1500 chars | no refs, <100 or >2000 |

### thread lifecycle (healthy flow)
1. scope definition (1-3 turns) — include file references
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff

section 6: anti-patterns

## anti-patterns to avoid

### premature action
acting before user confirms intent. triggers "wait..." interrupts (17% of all steerings).

❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ choosing file locations without asking

### scope creep
making changes beyond what user asked.

❌ running full test suite instead of targeted tests
❌ adding unwanted abstractions
❌ changing preserved behavior ("WTF. Keep using FillVector!")
❌ refactoring working code while fixing unrelated issue

### test weakening
removing/weakening assertions to make tests pass instead of fixing underlying bugs.

❌ "the agent is drunk and keeps trying to 'fix' the failing test by removing the failing assertion"

### simplification escape
when implementation gets hard, agent "simplifies" requirements instead of solving.

❌ "NOOOOOOOOOOOO. DON'T SIMPLIFY"
❌ creating new files instead of editing existing structure
❌ pivoting to easier approach when stuck

### context overload
>1500 char opening messages paradoxically cause MORE steering and longer threads than 300-700 char messages.

section 7: delegation patterns

## delegation patterns

### when to delegate (Task tool)
- discrete, scoped transformations ("fix X in file Y")
- parallelizable independent changes (2-6 concurrent tasks)
- repetitive operations across multiple files
- clear success criteria available

### when NOT to delegate
- debugging complex emergent behavior
- exploration/research needing context accumulation
- tasks requiring back-and-forth with user
- work where main thread has critical context subagents lack

### healthy delegation signals
- specific imperative verbs: fix, implement, update, add, convert
- file paths or component names in task description
- clear success criteria ("done" defined)
- proactive timing: during neutral phases, not after corrections

### unhealthy delegation
- spawning Task as escape hatch when confused (61.5% frustrated vs 40.5% resolved)
- delegating without clear spec
- spawning multiple concurrent tasks touching same files
- over-fragmentation (>5 spawn depth)

section 8: user-specific preferences (learned)

## user-specific patterns

### @concise_commander
- terse commands, high question rate (23%)
- 20% "wait" interrupts — confirm before EVERY action
- benchmark-heavy — ALWAYS remember `-run=xxx` flags
- marathon debugging sessions (50+ turns) are intentional workflow
- phrases: "DO NOT change it", "fix the tests", "commit"

### @steady_navigator
- 1% "wait" interrupts — more tolerant of autonomous action
- polite structured prompts ("please look at")
- screenshot-driven, iterative visual refinement
- explicit file paths expected
- post-hoc correction style vs interrupt

### @verbose_explorer
- verbose context frontloading (932 chars avg)
- meta-work focus: skills, tooling, infrastructure
- **power spawn user** — 231 subagents at 97.8% success
- cares about thread organization, spawning
- evening sessions underperform — steer toward afternoon work
- phrases: "search my amp threads", "ship it"

### @patient_pathfinder
- most directive (7% question ratio)
- concise task-focused prompts (293 chars)
- work hours only (07-17)
- low steering via precise specs

### @precision_pilot
- most verbose (2,037 chars avg)
- plan-oriented, architecture-first
- cross-references extensively
- streaming/session state specialist

ACTIONABLE CHECKLIST

for USERS

for AGENTS (AGENTS.md rules)

confirm before running tests, pushing code, expanding scope
remember flags across thread (-run=xxx, explicit file lists)
after steering, acknowledge and DO NOT repeat the behavior
if 2+ consecutive steerings, PAUSE and ask about approach
read reference files BEFORE responding when user provides paths
never weaken tests — debug root cause instead
use oracle early for planning, not late for rescue
delegate only when scope is clear and independent
monitor approval:steering ratio — below 1:1 is danger zone

for TOOLING (if instrumented)

track approval:steering ratio live — alert when < 1:1
detect consecutive steering — surface intervention prompt
monitor turn count — nudge at 50 and 100 turns
flag threads with 0 file references in opener
detect “simplification” patterns in agent output
detect test assertion removal/weakening

METRICS DASHBOARD

real-time thread health

┌─────────────────────────────────────────────────────────────────┐
│                    THREAD HEALTH INDICATORS                      │
├──────────────────┬────────────────────────────────────────────────
│ approval:steering│ ████████████████████░░░░  3.2:1  ✓ healthy   │
│ turn count       │ ██████████░░░░░░░░░░░░░░  42     ✓ good zone │
│ consecutive steer│ ░░░░░░░░░░░░░░░░░░░░░░░░  0      ✓ clean     │
│ last approval    │ ░░░░░░░░░░░░░░░░░░░░░░░░  3 turns ago        │
│ file refs opener │ ██████████████████████████ present ✓         │
└─────────────────────────────────────────────────────────────────┘

target metrics

metric	target	caution	danger
approval:steering ratio	>2:1	1-2:1	<1:1
steering rate per thread	<5%	5-8%	>8%
recovery rate (next msg not steering)	>85%	70-85%	<70%
consecutive steerings	0-1	2	3+
thread spawn depth	2-3	4-5	>5
opening message file refs	present	—	absent
opening message length	300-1500	100-300, 1500-2000	<100 or >2000
question density	<5%	5-15%	>15%

time-of-day performance

time block	resolution %	recommendation
2-5am	60.4%	best outcomes—deep focus
6-9am	59.6%	second best—fresh intent
10-1pm	48.0%	decent
2-5pm	43.2%	declining
6-9pm	27.5%	AVOID for important work
10pm-1am	47.1%	varies by user

user performance benchmarks

user	threads	resolution	steering	archetype
@concise_commander	1,219	60.5%	0.81	marathon debugger
@steady_navigator	1,171	67.0%	0.10	efficient executor
@verbose_explorer	875	83%	0.28	spawn orchestrator
@precision_pilot	90	82.2%	0.41	architect
@patient_pathfinder	150	54.0%	0.20	operator

outcome distribution

RESOLVED     ████████████████████████████████  59.0% (2,745)
UNKNOWN      ████████████████████████         32.6% (1,517)
COMMITTED    ████                              3.8% (175)
EXPLORATORY  ███                               2.7% (125)
HANDOFF      ██                                1.6% (75)
FRUSTRATED   ░                                 0.2% (10)

corrected 2026-01-09: spawned subagent threads previously miscounted as HANDOFF

DOMAIN EXPERTISE ROUTING

based on vocabulary fingerprinting and outcome rates:

domain	primary owner	secondary	success rate
storage engine (query_engine, storage_optimizer)	@concise_commander	—	84%
data visualization (canvas, chart)	@concise_commander	@steady_navigator	85%
observability/otel	@steady_navigator	@concise_commander	68%
build tooling (vite, pnpm)	@steady_navigator	—	63%
ai/agent tooling	@steady_navigator	@verbose_explorer	68%
devtools/amp skills	@verbose_explorer	—	varies
minecraft/fabric modding	@verbose_explorer	—	personal
infrastructure (k8s, prometheus)	@patient_pathfinder	—	63%
streaming/sessions	@precision_pilot	—	82%
search_modal/analytics_service features	@feature_lead	—	45% handoff

FAILURE ARCHETYPES (what kills threads)

archetype	frequency	trigger	fix
PREMATURE_COMPLETION	common	declaring done without verification	always run tests before claiming complete
OVER_ENGINEERING	common	adding unnecessary abstractions	question every exposed prop/method
SIMPLIFICATION_ESCAPE	common	reducing requirements when stuck	persist with debugging, not scope reduction
TEST_WEAKENING	moderate	removing assertions instead of fixing bugs	NEVER modify expected values without fixing impl
HACKING_AROUND_PROBLEM	moderate	fragile patches not proper fixes	read docs, understand root cause
IGNORING_CODEBASE_PATTERNS	moderate	not reading reference implementations	Read files user provides FIRST
NO_DELEGATION	moderate	not spawning subtasks	use Task for clearly scoped parallel work
NOT_READING_DOCS	moderate	unfamiliar library usage without docs	web_search for library docs before implementing

STEERING TAXONOMY

pattern	% of steerings	meaning	response
”No…“	47%	flat rejection	acknowledge, reverse course
”Wait…“	17%	premature action	confirm before continuing
”Don’t…“	8%	explicit prohibition	add to user prefs
”Actually…“	3%	course correction	acknowledge, adjust
”Stop…“	2%	halt current action	immediate pause
”Undo…“	1%	revert changes	revert, ask what to preserve
”WTF…“	1%	frustration signal	PAUSE, meta-acknowledge, realign

RESEARCH ALIGNMENT

findings from web research confirm patterns observed in data:

amp finding	research confirmation
steering correlates with success	iterative patterns > linear copy-paste (Ouyang et al. 2024)
terse + questions > verbose dumps	structured short prompts often outperform verbose (Gupta 2024)
approval:steering ratio predicts outcomes	positive feedback loops = iterative prompting cycles
user archetypes show consistent patterns	big five personality maps to interaction styles

WHAT WE’RE CONFIDENT ABOUT

structural patterns (turn counts, ratios) are statistically robust across 4,656 threads
user archetype patterns are consistent within users across time
steering taxonomy is empirically grounded (47% “no”, 17% “wait”)
file reference effect (+25%) is the strongest single predictor
26-50 turns = sweet spot for resolution

WHAT’S STILL HUNCH

causal direction between oracle usage and frustration
whether terse style CAUSES success or reflects expertise
optimal confirmation frequency (too much also annoys users)
whether midnight/weekend effects are time or user composition
learning curve transferability between domains

QUICK REFERENCE CARD

┌─────────────────────────────────────────────────────────────────┐
│                    AMP THREAD SUCCESS FACTORS                    │
├─────────────────────────────────────────────────────────────────┤
│ ✓ file references (@path) → +25% success                        │
│ ✓ 300-1500 char prompts → lowest steering                       │
│ ✓ 26-50 turns → 75% success rate                                │
│ ✓ approval:steering >2:1 → healthy thread                       │
│ ✓ "ship it" / "commit" → explicit checkpoints                   │
│ ✓ oracle at planning, not rescue                                │
│ ✓ 2-6 spawned tasks → optimal delegation                        │
├─────────────────────────────────────────────────────────────────┤
│ ✗ <10 turns → 14% success (abandoned)                           │
│ ✗ >100 turns → frustration risk increases                       │
│ ✗ ratio <1:1 → doom spiral, pause and realign                   │
│ ✗ 2+ consecutive steerings → fundamental misalignment           │
│ ✗ oracle as last resort → too late, use for planning            │
│ ✗ >1500 char opener → paradoxically MORE problems               │
│ ✗ evening work (6-9pm) → 27.5% resolution (worst)               │
├─────────────────────────────────────────────────────────────────┤
│ BEST TIMES: 2-5am (60%), 6-9am (59%), weekends (+5pp)           │
│ WORST TIME: 6-9pm (27%) — avoid for critical work               │
├─────────────────────────────────────────────────────────────────┤
│ STEERING TAXONOMY                                               │
│ 47% "no..." (rejection) | 17% "wait..." (premature action)     │
│ 8% "don't..." | 3% "actually..." | 2% "stop..."                │
├─────────────────────────────────────────────────────────────────┤
│ RECOVERY: 87% of steerings don't cascade                        │
│ DOOM LOOP: 2+ consecutive steerings = stop and ask              │
└─────────────────────────────────────────────────────────────────┘

synthesized by don_nibbleward from 48 insight files | 2026-01-09 corpus: 4,656 threads | 208,799 messages | 20 users | may 2025 – jan 2026