ULTIMATE SYNTHESIS: amp thread analysis
the ONE document. 4,656 threads. 208,799 messages. 20 users. 9 months. 48 insight files distilled.
POWER RANKINGS: findings by impact
| rank | finding | effect size | source |
|---|
| 1 | file references in opener (@path) | +25pp success (66.7% vs 41.8%) | first-message-patterns |
| 2 | approval:steering ratio > 2:1 | 4x success vs <1:1 | thread-flow, conversation-dynamics |
| 3 | 26-50 turns sweet spot | 75% success vs 14% for <10 turns | length-analysis |
| 4 | steering = engagement, not failure | 60% resolution steered vs 37% unsteered | MEGA-SYNTHESIS |
| 5 | confirm before action | 47% of steerings are “no…”, 17% are “wait…“ | steering-deep-dive |
TIER 2: HIGH IMPACT (adopt this week)
| rank | finding | effect size | source |
|---|
| 6 | 300-1500 char prompts optimal | lowest steering (.20-.21) | message-brevity |
| 7 | terse + high questions = best | 60% resolution for this style | user-comparison |
| 8 | oracle early, not late | 46% frustrated threads use oracle vs 25% resolved | oracle-timing |
| 9 | 2-6 Task spawns optimal | 78.6% success at 4-6 tasks | task-delegation |
| 10 | test context = 2.15x resolution | 56.7% vs 26.3% | testing-patterns |
TIER 3: MODERATE IMPACT (adopt this month)
| rank | finding | effect size | source |
|---|
| 11 | multi-file threads outperform | 72% vs 47% for single-file | multi-file-edits |
| 12 | weekend premium | +5.2pp resolution (48.9% vs 43.7%) | weekend-analysis |
| 13 | late night/early morning best | 60% resolution 2-5am vs 27.5% 6-9pm | time-analysis |
| 14 | interrogative style wins | 69.3% success rate | prompting-styles |
| 15 | commit/push imperatives | 89.2% resolution | imperative-analysis |
TIER 4: NUANCED (context-dependent)
| rank | finding | effect size | source |
|---|
| 16 | low question density = higher resolution | 76% for <5% questions | question-analysis |
| 17 | learning is real | 66% reduction in turn count over 8 months (@verbose_explorer) | learning-curves |
| 18 | refactoring succeeds 3x more than migration | 63.3% vs 20.7% | refactoring-patterns |
| 19 | 87% steering recovery rate | only 9.4% cascade to another steering | conversation-dynamics |
| 20 | collaborative openers (“we”, “let’s”) = longest threads | 249 avg messages | opening-words |
FRUSTRATION PREDICTION: early warning system
the doom spiral sequence
STAGE 0: agent takes shortcut (invisible)
↓
STAGE 1: "no" / "wait" / "actually" (50% recovery)
↓
STAGE 2: consecutive steerings (40% recovery)
↓
STAGE 3: "wtf" / "fucking" / ALL CAPS (20% recovery)
↓
STAGE 4: "NOOOOOOOO" / profanity explosion (<10% recovery)
quantitative intervention thresholds
| metric | yellow | red |
|---|
| approval:steering ratio | < 2:1 | < 1:1 |
| consecutive steerings | 2 | 3+ |
| turns without approval | 15 | 25 |
| steering density | > 5% | > 8% |
risk = (steering_count × 2)
+ (consecutive_steerings × 3)
+ (simplification_detected × 4)
+ (test_weakening_detected × 5)
- (approval_count × 2)
- (file_reference_in_opener × 3)
thresholds:
>= 3: suggest rephrasing approach
>= 6: suggest oracle or spawn
>= 10: offer handoff to fresh thread
USER ARCHETYPES & CHEAT SHEETS
@concise_commander: the marathon debugger
- 1,219 threads | 86.5 avg turns | 60.5% resolution
- terse (263 chars) | 23% questions | high steering (0.81)
- domain: storage engine, performance, SIMD
what works: socratic questioning (“OK, what’s next?”), marathon persistence, explicit approvals
what triggers steering: premature action, forgetting flags (-run=xxx), full test suites
phrases: “wait”, “dont”, “NO FUCKING SHORTCUTS”
@steady_navigator: the efficient executor
- 1,171 threads | 36.5 avg turns | 67% resolution
- moderate (547 chars) | 43% questions | LOW steering (0.10)
- domain: observability, frontend, ai tooling
what works: polite structured prompts, post-hoc corrections, screenshot-driven
what triggers steering: rarely (2.6% rate)—uses post-hoc rejection not interrupts
phrases: “please look at”, “almost there”, “see screenshot”
@verbose_explorer: the spawn orchestrator
- 875 threads | 39.1 avg turns | 83% resolution (corrected)
- verbose (932 chars) | 26% questions | moderate steering (0.28)
- domain: devtools, personal projects, skills
- spawned 231 subagents with 97.8% success rate
what works: effective spawn orchestration, long threads (78% resolution at 100+ turns), steering questions as opener
what hurts: evening sessions (lower resolution 19:00-22:00)
note: prior analysis miscounted spawned subagent threads as handoffs, inflating “handoff rate” to 30% and deflating resolution to 33.8%
@precision_pilot: the architect
- 90 threads | 72.9 avg turns | 82.2% resolution
- VERY verbose (2,037 chars) | 34% questions
- domain: streaming, sessions, architecture
what works: plan-oriented prompts, cross-references, multi-thread orchestration
@patient_pathfinder: the infrastructure operator
- 150 threads | 20.3 avg turns | 54% resolution
- concise (293 chars) | 7% questions (most directive)
- domain: kubernetes, prometheus, infrastructure
what works: work hours only (07-17), precise specs, minimal back-and-forth
@feature_lead: the feature spec writer
- 146 threads | 20.7 avg turns | 26% resolution
- detailed (780 chars) | 11% questions | 45% handoff rate
- domain: search_modal, analytics_service, observability features
what works: spec-and-delegate pattern, external code review integration
AGENTS.MD: COPY-PASTE READY
section 1: confirmation gates
## before taking action
confirm with user before:
- running tests/benchmarks (especially with flags like `-run=xxx`, `-bench=xxx`)
- pushing code or creating commits
- modifying files outside explicitly mentioned scope
- adding abstractions or changing existing behavior
- running full test suites instead of targeted tests
ASK: "ready to run the tests?" rather than "running the tests now..."
### flag memory
remember user-specified flags across the thread:
- benchmark flags: `-run=xxx`, `-bench=xxx`, `-benchstat`
- test filters: specific test names, package paths
- git conventions: avoid `git add -A`, use explicit file lists
when running similar commands, preserve flags from previous invocations.
section 2: steering recovery
## after receiving steering
1. acknowledge the correction explicitly
2. do NOT repeat the corrected behavior
3. if pattern recurs (2+ steerings for same issue), ask user for explicit preference
4. track common corrections for this user
### recovery expectations
- 87% of steerings should NOT be followed by another steering
- if you hit 2+ consecutive steerings, PAUSE and ask if approach should change
- after STEERING → APPROVAL sequence, user has validated the correction
section 3: thread health monitoring
## thread health indicators
### healthy signals
- approval:steering ratio > 2:1
- steady progress with occasional approvals
- spawning subtasks for parallel work
- consistent approval distribution across phases
### warning signals
- ratio drops below 1:1 — intervention needed
- 100+ turns without resolution — marathon risk
- 2+ consecutive steerings — doom spiral forming
- user messages getting longer — frustration signal
### action when unhealthy
1. pause and summarize current state
2. ask if approach should change
3. offer to spawn fresh thread with lessons learned
section 4: oracle usage
## oracle usage
### DO use oracle for
- planning before implementation
- architecture decisions
- code review pre-merge
- debugging hypotheses
- early phase ideation
### DON'T use oracle as
- last resort when stuck (too late—46% of frustrated threads reached for oracle)
- replacement for reading code
- magic fix for unclear requirements
- panic button after 100+ turns
### oracle timing
integrate EARLY (planning phase), not LATE (rescue phase). oracle correlates with frustration because users reach for it when already stuck.
section 5: optimal patterns
## optimal thread patterns
### success predictors
| metric | target | red flag |
|--------|--------|----------|
| approval:steering ratio | >2:1 | <1:1 |
| thread length | 26-50 turns | >100 without resolution |
| question density | <5% | >15% |
| steering recovery | next msg not steering | consecutive steerings |
| opening message | file refs, 300-1500 chars | no refs, <100 or >2000 |
### thread lifecycle (healthy flow)
1. scope definition (1-3 turns) — include file references
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff
section 6: anti-patterns
## anti-patterns to avoid
### premature action
acting before user confirms intent. triggers "wait..." interrupts (17% of all steerings).
❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ choosing file locations without asking
### scope creep
making changes beyond what user asked.
❌ running full test suite instead of targeted tests
❌ adding unwanted abstractions
❌ changing preserved behavior ("WTF. Keep using FillVector!")
❌ refactoring working code while fixing unrelated issue
### test weakening
removing/weakening assertions to make tests pass instead of fixing underlying bugs.
❌ "the agent is drunk and keeps trying to 'fix' the failing test by removing the failing assertion"
### simplification escape
when implementation gets hard, agent "simplifies" requirements instead of solving.
❌ "NOOOOOOOOOOOO. DON'T SIMPLIFY"
❌ creating new files instead of editing existing structure
❌ pivoting to easier approach when stuck
### context overload
>1500 char opening messages paradoxically cause MORE steering and longer threads than 300-700 char messages.
section 7: delegation patterns
## delegation patterns
### when to delegate (Task tool)
- discrete, scoped transformations ("fix X in file Y")
- parallelizable independent changes (2-6 concurrent tasks)
- repetitive operations across multiple files
- clear success criteria available
### when NOT to delegate
- debugging complex emergent behavior
- exploration/research needing context accumulation
- tasks requiring back-and-forth with user
- work where main thread has critical context subagents lack
### healthy delegation signals
- specific imperative verbs: fix, implement, update, add, convert
- file paths or component names in task description
- clear success criteria ("done" defined)
- proactive timing: during neutral phases, not after corrections
### unhealthy delegation
- spawning Task as escape hatch when confused (61.5% frustrated vs 40.5% resolved)
- delegating without clear spec
- spawning multiple concurrent tasks touching same files
- over-fragmentation (>5 spawn depth)
section 8: user-specific preferences (learned)
## user-specific patterns
### @concise_commander
- terse commands, high question rate (23%)
- 20% "wait" interrupts — confirm before EVERY action
- benchmark-heavy — ALWAYS remember `-run=xxx` flags
- marathon debugging sessions (50+ turns) are intentional workflow
- phrases: "DO NOT change it", "fix the tests", "commit"
### @steady_navigator
- 1% "wait" interrupts — more tolerant of autonomous action
- polite structured prompts ("please look at")
- screenshot-driven, iterative visual refinement
- explicit file paths expected
- post-hoc correction style vs interrupt
### @verbose_explorer
- verbose context frontloading (932 chars avg)
- meta-work focus: skills, tooling, infrastructure
- **power spawn user** — 231 subagents at 97.8% success
- cares about thread organization, spawning
- evening sessions underperform — steer toward afternoon work
- phrases: "search my amp threads", "ship it"
### @patient_pathfinder
- most directive (7% question ratio)
- concise task-focused prompts (293 chars)
- work hours only (07-17)
- low steering via precise specs
### @precision_pilot
- most verbose (2,037 chars avg)
- plan-oriented, architecture-first
- cross-references extensively
- streaming/session state specialist
ACTIONABLE CHECKLIST
for USERS
for AGENTS (AGENTS.md rules)
METRICS DASHBOARD
real-time thread health
┌─────────────────────────────────────────────────────────────────┐
│ THREAD HEALTH INDICATORS │
├──────────────────┬────────────────────────────────────────────────
│ approval:steering│ ████████████████████░░░░ 3.2:1 ✓ healthy │
│ turn count │ ██████████░░░░░░░░░░░░░░ 42 ✓ good zone │
│ consecutive steer│ ░░░░░░░░░░░░░░░░░░░░░░░░ 0 ✓ clean │
│ last approval │ ░░░░░░░░░░░░░░░░░░░░░░░░ 3 turns ago │
│ file refs opener │ ██████████████████████████ present ✓ │
└─────────────────────────────────────────────────────────────────┘
target metrics
| metric | target | caution | danger |
|---|
| approval:steering ratio | >2:1 | 1-2:1 | <1:1 |
| steering rate per thread | <5% | 5-8% | >8% |
| recovery rate (next msg not steering) | >85% | 70-85% | <70% |
| consecutive steerings | 0-1 | 2 | 3+ |
| thread spawn depth | 2-3 | 4-5 | >5 |
| opening message file refs | present | — | absent |
| opening message length | 300-1500 | 100-300, 1500-2000 | <100 or >2000 |
| question density | <5% | 5-15% | >15% |
| time block | resolution % | recommendation |
|---|
| 2-5am | 60.4% | best outcomes—deep focus |
| 6-9am | 59.6% | second best—fresh intent |
| 10-1pm | 48.0% | decent |
| 2-5pm | 43.2% | declining |
| 6-9pm | 27.5% | AVOID for important work |
| 10pm-1am | 47.1% | varies by user |
| user | threads | resolution | steering | archetype |
|---|
| @concise_commander | 1,219 | 60.5% | 0.81 | marathon debugger |
| @steady_navigator | 1,171 | 67.0% | 0.10 | efficient executor |
| @verbose_explorer | 875 | 83% | 0.28 | spawn orchestrator |
| @precision_pilot | 90 | 82.2% | 0.41 | architect |
| @patient_pathfinder | 150 | 54.0% | 0.20 | operator |
outcome distribution
RESOLVED ████████████████████████████████ 59.0% (2,745)
UNKNOWN ████████████████████████ 32.6% (1,517)
COMMITTED ████ 3.8% (175)
EXPLORATORY ███ 2.7% (125)
HANDOFF ██ 1.6% (75)
FRUSTRATED ░ 0.2% (10)
corrected 2026-01-09: spawned subagent threads previously miscounted as HANDOFF
DOMAIN EXPERTISE ROUTING
based on vocabulary fingerprinting and outcome rates:
| domain | primary owner | secondary | success rate |
|---|
| storage engine (query_engine, storage_optimizer) | @concise_commander | — | 84% |
| data visualization (canvas, chart) | @concise_commander | @steady_navigator | 85% |
| observability/otel | @steady_navigator | @concise_commander | 68% |
| build tooling (vite, pnpm) | @steady_navigator | — | 63% |
| ai/agent tooling | @steady_navigator | @verbose_explorer | 68% |
| devtools/amp skills | @verbose_explorer | — | varies |
| minecraft/fabric modding | @verbose_explorer | — | personal |
| infrastructure (k8s, prometheus) | @patient_pathfinder | — | 63% |
| streaming/sessions | @precision_pilot | — | 82% |
| search_modal/analytics_service features | @feature_lead | — | 45% handoff |
FAILURE ARCHETYPES (what kills threads)
| archetype | frequency | trigger | fix |
|---|
| PREMATURE_COMPLETION | common | declaring done without verification | always run tests before claiming complete |
| OVER_ENGINEERING | common | adding unnecessary abstractions | question every exposed prop/method |
| SIMPLIFICATION_ESCAPE | common | reducing requirements when stuck | persist with debugging, not scope reduction |
| TEST_WEAKENING | moderate | removing assertions instead of fixing bugs | NEVER modify expected values without fixing impl |
| HACKING_AROUND_PROBLEM | moderate | fragile patches not proper fixes | read docs, understand root cause |
| IGNORING_CODEBASE_PATTERNS | moderate | not reading reference implementations | Read files user provides FIRST |
| NO_DELEGATION | moderate | not spawning subtasks | use Task for clearly scoped parallel work |
| NOT_READING_DOCS | moderate | unfamiliar library usage without docs | web_search for library docs before implementing |
STEERING TAXONOMY
| pattern | % of steerings | meaning | response |
|---|
| ”No…“ | 47% | flat rejection | acknowledge, reverse course |
| ”Wait…“ | 17% | premature action | confirm before continuing |
| ”Don’t…“ | 8% | explicit prohibition | add to user prefs |
| ”Actually…“ | 3% | course correction | acknowledge, adjust |
| ”Stop…“ | 2% | halt current action | immediate pause |
| ”Undo…“ | 1% | revert changes | revert, ask what to preserve |
| ”WTF…“ | 1% | frustration signal | PAUSE, meta-acknowledge, realign |
RESEARCH ALIGNMENT
findings from web research confirm patterns observed in data:
| amp finding | research confirmation |
|---|
| steering correlates with success | iterative patterns > linear copy-paste (Ouyang et al. 2024) |
| terse + questions > verbose dumps | structured short prompts often outperform verbose (Gupta 2024) |
| approval:steering ratio predicts outcomes | positive feedback loops = iterative prompting cycles |
| user archetypes show consistent patterns | big five personality maps to interaction styles |
WHAT WE’RE CONFIDENT ABOUT
- structural patterns (turn counts, ratios) are statistically robust across 4,656 threads
- user archetype patterns are consistent within users across time
- steering taxonomy is empirically grounded (47% “no”, 17% “wait”)
- file reference effect (+25%) is the strongest single predictor
- 26-50 turns = sweet spot for resolution
WHAT’S STILL HUNCH
- causal direction between oracle usage and frustration
- whether terse style CAUSES success or reflects expertise
- optimal confirmation frequency (too much also annoys users)
- whether midnight/weekend effects are time or user composition
- learning curve transferability between domains
QUICK REFERENCE CARD
┌─────────────────────────────────────────────────────────────────┐
│ AMP THREAD SUCCESS FACTORS │
├─────────────────────────────────────────────────────────────────┤
│ ✓ file references (@path) → +25% success │
│ ✓ 300-1500 char prompts → lowest steering │
│ ✓ 26-50 turns → 75% success rate │
│ ✓ approval:steering >2:1 → healthy thread │
│ ✓ "ship it" / "commit" → explicit checkpoints │
│ ✓ oracle at planning, not rescue │
│ ✓ 2-6 spawned tasks → optimal delegation │
├─────────────────────────────────────────────────────────────────┤
│ ✗ <10 turns → 14% success (abandoned) │
│ ✗ >100 turns → frustration risk increases │
│ ✗ ratio <1:1 → doom spiral, pause and realign │
│ ✗ 2+ consecutive steerings → fundamental misalignment │
│ ✗ oracle as last resort → too late, use for planning │
│ ✗ >1500 char opener → paradoxically MORE problems │
│ ✗ evening work (6-9pm) → 27.5% resolution (worst) │
├─────────────────────────────────────────────────────────────────┤
│ BEST TIMES: 2-5am (60%), 6-9am (59%), weekends (+5pp) │
│ WORST TIME: 6-9pm (27%) — avoid for critical work │
├─────────────────────────────────────────────────────────────────┤
│ STEERING TAXONOMY │
│ 47% "no..." (rejection) | 17% "wait..." (premature action) │
│ 8% "don't..." | 3% "actually..." | 2% "stop..." │
├─────────────────────────────────────────────────────────────────┤
│ RECOVERY: 87% of steerings don't cascade │
│ DOOM LOOP: 2+ consecutive steerings = stop and ask │
└─────────────────────────────────────────────────────────────────┘
synthesized by don_nibbleward from 48 insight files | 2026-01-09
corpus: 4,656 threads | 208,799 messages | 20 users | may 2025 – jan 2026