MEGA-SYNTHESIS: amp thread analysis
consolidated findings from 23 analysis documents spanning 4,656 threads, 208,799 messages, 20 users, 9 months of data.
TOP 20 ACTIONABLE FINDINGS
structural patterns
-
26-50 turns is the sweet spot — 75% success rate. under 10 turns = 14% (abandoned). over 100 turns = frustration risk.
-
approval:steering ratio predicts outcomes
>4:1→ COMMITTED (clean execution)2-4:1→ RESOLVED (healthy)<1:1→ FRUSTRATED (doom spiral)- crossing below 1:1 = intervention signal
-
file references = +25% success — threads starting with
@path/to/filehave 66.7% success vs 41.8% without. STRONGEST predictor. -
brief OR extensive, not moderate — U-shaped curve. 300-1500 char prompts hit sweet spot. very long (>1500) paradoxically causes MORE steering.
-
low question density = higher resolution — threads with <5% questions resolve at 76%. interrogative style ≠ productive.
steering patterns
-
47% of steerings are flat rejections (“no…”) — nearly half. 17% are “wait” interrupts (agent acted before confirmation).
-
87% recovery rate after steering — only 9.4% of steerings lead to another steering. most corrections work.
-
steering triggers: premature action, scope creep, forgotten flags (
-run=xxx), wrong file locations, unwanted abstractions. -
consecutive steerings = doom loop — 2+ in a row signals fundamental misalignment. only 15 cases of 3+ consecutive in entire corpus.
-
steering late in thread = scope drift — early steering about misunderstood intent. late steering about accumulated frustration.
user patterns
-
terse + high questions = best outcomes — @concise_commander: 263 chars, 23% questions, 60% resolution. verbose context-dumping underperforms.
-
marathon runners succeed — 69% of @concise_commander’s threads exceed 50 turns. persistence correlates with resolution.
-
socratic style works — “OK, and what is next?” keeps agent planning visible. better than frontloading dense context.
-
high approval:steering ratio — @steady_navigator: 3x approvals per steer, lowest steering rate (2.6%). explicit positive feedback reduces corrections.
-
learning is real — @verbose_explorer: 66% reduction in thread length over 8 months (68→23 avg turns).
tool patterns
-
oracle is a “stuck” signal — 46% of FRUSTRATED threads use oracle vs 25% of RESOLVED. reached for when already stuck, not for prevention.
-
Task usage correlates with frustration — 61.5% of frustrated threads use Task vs 40.5% of resolved. over-delegation when confused.
-
core workflow is Bash + edit_file + Read — 3 tools dominate. more messages ≠ better outcomes.
-
finder is underutilized — only 11% of resolved threads use it. possibly needs better prompting awareness.
failure patterns
- 7 failure archetypes:
- PREMATURE_COMPLETION: declaring done without verification
- OVER_ENGINEERING: unnecessary abstractions
- HACKING_AROUND_PROBLEM: fragile patches not proper fixes
- IGNORING_CODEBASE_PATTERNS: not reading reference implementations
- NO_DELEGATION: not spawning subtasks
- TEST_WEAKENING: removing assertions instead of fixing bugs
- NOT_READING_DOCS: unfamiliar library usage without docs
USER CHEAT SHEETS
for ALL users
✓ include file references (@path/to/file) in opening message
✓ aim for 26-50 turns — not too short, not marathon
✓ use imperative style ("fix X" not "i want X fixed")
✓ terse prompts + follow-up questions > verbose context dumps
✓ approve explicitly ("ship it", "commit") when satisfied
✓ steer early if off-track — corrections work 87% of the time
✓ spawn subtasks for parallel independent work
✓ use oracle at planning phase, not rescue phase
✗ don't abandon threads < 10 turns without handoff
✗ don't frontload >1500 chars (causes MORE problems)
✗ don't let steering:approval drop below 1:1 without pausing
for TERSE USERS (like @concise_commander)
✓ short commands work — 263 chars avg is fine
✓ high question rate (23%) keeps agent aligned
✓ marathon sessions (50+ turns) work for focused domains
✓ "OK, what's next?" checkpoints are effective
✓ explicit approval signals (16% of messages) reduce corrections
⚠ confirm before agent runs tests/pushes — you steer on premature action
⚠ remember benchmark flags across sessions (-run=xxx)
for SPAWN ORCHESTRATORS (like @verbose_explorer)
✓ front-loading context enables high spawn success (97.8% on 231 subagents)
✓ 83% resolution rate — top tier performer
✓ meta-work (skills, tooling) benefits from explicit commit closures
✓ verbose context (932 chars) provides rich spawn instructions
⚠ explicit "ship it" closures make threads more efficient (40% shorter)
note: prior analysis miscounted @verbose_explorer’s spawns as handoffs, incorrectly showing 30% handoff rate. corrected 2026-01-09.
for VISUAL/ITERATIVE USERS (like @steady_navigator)
✓ screenshot-driven workflow is effective
✓ polite structured prompts work — "please look at X"
✓ low steering rate (2.6%) via precise post-hoc corrections
✓ explicit file paths prevent confusion
✓ iterative visual refinement ("almost there", "still off")
⚠ 43% question ratio is high — focused work with fewer questions resolves faster
for INFRASTRUCTURE/OPERATORS (like @patient_pathfinder)
✓ 7% question ratio is optimal — most directive style
✓ concise task-focused prompts (293 chars)
✓ work hours only (07-17) → clean operational patterns
✓ low steering (0.22) via precise specs
TIME-BASED RECOMMENDATIONS
| time block | resolution % | recommendation |
|---|---|---|
| late night (2-5am) | 60.4% | best outcomes — deep focus |
| morning (6-9am) | 59.6% | second best — fresh intent |
| midday (10-13) | 48.0% | decent |
| afternoon (14-17) | 43.2% | declining |
| evening (18-21) | 27.5% | WORST — avoid for important work |
weekend premium: 48.9% resolution vs 43.7% weekday (+5.2pp)
EXACT AGENTS.MD TEXT TO ADD
section: confirmation gates
## before taking action
confirm with user before:
- running tests/benchmarks (especially with flags like `-run=xxx`, `-bench=xxx`)
- pushing code or creating commits
- modifying files outside explicitly mentioned scope
- adding abstractions or changing existing behavior
- running full test suites instead of targeted tests
ASK: "ready to run the tests?" rather than "running the tests now..."
### flag memory
remember user-specified flags across the thread:
- benchmark flags: `-run=xxx`, `-bench=xxx`, `-benchstat`
- test filters: specific test names, package paths
- git conventions: avoid `git add -A`, use explicit file lists
when running similar commands, preserve flags from previous invocations.
section: steering recovery
## after receiving steering
1. acknowledge the correction explicitly
2. do NOT repeat the corrected behavior
3. if pattern recurs (2+ steerings for same issue), ask user for explicit preference
4. track common corrections for this user
### steering → recovery expectations
- 87% of steerings should NOT be followed by another steering
- if you hit 2+ consecutive steerings, PAUSE and ask if approach should change
- after STEERING → APPROVAL sequence, user has validated the correction
section: thread health monitoring
## thread health indicators
### healthy signals
- approval:steering ratio > 2:1
- steady progress with occasional approvals
- spawning subtasks for parallel work
- consistent approval distribution across phases
### warning signals
- ratio drops below 1:1 — intervention needed
- 100+ turns without resolution — marathon risk
- 2+ consecutive steerings — doom spiral forming
- user messages getting longer — frustration signal
### action when unhealthy
1. pause and summarize current state
2. ask if approach should change
3. offer to spawn fresh thread with lessons learned
section: oracle usage
## oracle usage
### DO use oracle for
- planning before implementation
- architecture decisions
- code review pre-merge
- debugging hypotheses
- early phase ideation
### DON'T use oracle as
- last resort when stuck (too late — 46% of frustrated threads reached for oracle)
- replacement for reading code
- magic fix for unclear requirements
- panic button after 100+ turns
### oracle timing
integrate EARLY (planning phase), not LATE (rescue phase). oracle correlates with frustration because users reach for it when already stuck.
section: optimal thread patterns
## optimal thread patterns
### success predictors
| metric | target | red flag |
|--------|--------|----------|
| approval:steering ratio | >2:1 | <1:1 |
| thread length | 26-50 turns | >100 without resolution |
| question density | <5% | >15% |
| steering recovery | next msg not steering | consecutive steerings |
### thread lifecycle (healthy flow)
1. scope definition (1-3 turns) — include file references
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff
### opening message best practices
- include file references (@path/to/file) — +25% success
- 300-1500 chars optimal (not too brief, not overwhelming)
- imperative style > declarative ("fix X" not "i want X")
- questions for exploration, commands for execution
section: delegation patterns
## delegation patterns
### healthy delegation
- use Task for clearly scoped, independent work
- spawn depth of 2-3 is healthy; beyond 5 suggests over-fragmentation
- each subtask should have clear scope and exit criteria
### unhealthy delegation
- spawning Task as escape hatch when confused
- delegating without clear spec
- spawning multiple concurrent tasks that touch same files
- Task usage 61.5% in frustrated vs 40.5% in resolved — over-delegation is a smell
### when to spawn
- multi-phase work: plan → implement → test → fix → verify
- parallel independent subtasks (don't touch same files)
- when stuck in single context and approach needs reset
section: anti-patterns
## anti-patterns to avoid
### premature action
acting before user confirms intent. triggers "wait..." interrupts (17% of all steerings).
❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ choosing file locations without asking
### scope creep
making changes beyond what user asked.
❌ running full test suite instead of targeted tests
❌ adding unwanted abstractions
❌ changing preserved behavior ("WTF. Keep using FillVector!")
❌ refactoring working code while fixing unrelated issue
### test weakening
removing/weakening assertions to make tests pass instead of fixing underlying bugs.
❌ "the agent is drunk and keeps trying to 'fix' the failing test by removing the failing assertion"
### oracle as panic button
reaching for oracle only when already stuck correlates with frustration, not resolution.
### context overload
>1500 char opening messages paradoxically cause MORE steering and longer threads than 300-700 char messages.
section: user-specific preferences
## user-specific patterns (learned)
### @concise_commander
- terse commands, high question rate (23%)
- 20% "wait" interrupts — confirm before every action
- benchmark-heavy — ALWAYS remember `-run=xxx` flags
- marathon debugging sessions (50+ turns) are intentional workflow
- phrases: "DO NOT change it", "fix the tests", "commit"
### @steady_navigator
- 1% "wait" interrupts — more tolerant of autonomous action
- polite structured prompts ("please look at")
- screenshot-driven, iterative visual refinement
- explicit file paths expected
- post-hoc correction style vs interrupt
### @verbose_explorer
- verbose context frontloading (932 chars avg)
- meta-work focus: skills, tooling, infrastructure
- **power spawn user** — 231 subagents at 97.8% success, 83% resolution
- cares about thread organization, spawning
- phrases: "search my amp threads", "ship it"
*note: prior analysis miscounted spawned subagent threads as handoffs, incorrectly showing 30% handoff rate. corrected 2026-01-09.*
QUICK REFERENCE CARD
┌─────────────────────────────────────────────────────────────────┐
│ AMP THREAD SUCCESS FACTORS │
├─────────────────────────────────────────────────────────────────┤
│ ✓ file references (@path) → +25% success │
│ ✓ 300-1500 char prompts → lowest steering │
│ ✓ 26-50 turns → 75% success rate │
│ ✓ approval:steering >2:1 → healthy thread │
│ ✓ "ship it" / "commit" → explicit checkpoints │
├─────────────────────────────────────────────────────────────────┤
│ ✗ <10 turns → 14% success (abandoned) │
│ ✗ >100 turns → frustration risk increases │
│ ✗ ratio <1:1 → doom spiral, pause and realign │
│ ✗ 2+ consecutive steerings → fundamental misalignment │
│ ✗ oracle as last resort → too late, use for planning │
├─────────────────────────────────────────────────────────────────┤
│ BEST TIMES: 2-5am (60%), 6-9am (59%) │
│ WORST TIME: 6-9pm (27%) — avoid for critical work │
├─────────────────────────────────────────────────────────────────┤
│ STEERING TAXONOMY │
│ 47% "no..." (rejection) | 17% "wait..." (premature action) │
│ 8% "don't..." | 3% "actually..." | 2% "stop..." │
└─────────────────────────────────────────────────────────────────┘
METRICS TO TRACK (if instrumented)
| metric | target | red flag |
|---|---|---|
| steering rate per thread | <5% | >8% |
| approval:steering ratio | >2:1 | <1:1 |
| recovery rate after steering | >85% | <70% |
| consecutive steering count | 0-1 | >2 |
| thread spawn depth | 2-3 | >5 |
| opening message file refs | present | absent |
| opening message length | 300-1500 chars | <100 or >2000 |
SOURCES
synthesized from:
- first-message-patterns.md
- learning-curves.md
- length-analysis.md
- error-analysis.md
- message-brevity.md
- conversation-dynamics.md
- steering-deep-dive.md
- @verbose_explorer-specific.md
- tool-patterns.md
- user-comparison.md
- time-analysis.md
- skill-usage.md
- web-research-nlp.md
- failure-autopsy.md
- SYNTHESIS.md
- agents-md-recommendations.md
- question-analysis.md
- thread-flow.md
- web-research-human-ai.md
- web-research-personality.md
- approval-triggers.md
- user-profiles.md
- vocabulary-analysis.md
synthesized by jack_ribbonsun | 2026-01-09