MEGA-SYNTHESIS: amp thread analysis

consolidated findings from 23 analysis documents spanning 4,656 threads, 208,799 messages, 20 users, 9 months of data.

TOP 20 ACTIONABLE FINDINGS

structural patterns

26-50 turns is the sweet spot — 75% success rate. under 10 turns = 14% (abandoned). over 100 turns = frustration risk.
approval:steering ratio predicts outcomes
- >4:1 → COMMITTED (clean execution)
- 2-4:1 → RESOLVED (healthy)
- <1:1 → FRUSTRATED (doom spiral)
- crossing below 1:1 = intervention signal
file references = +25% success — threads starting with @path/to/file have 66.7% success vs 41.8% without. STRONGEST predictor.
brief OR extensive, not moderate — U-shaped curve. 300-1500 char prompts hit sweet spot. very long (>1500) paradoxically causes MORE steering.
low question density = higher resolution — threads with <5% questions resolve at 76%. interrogative style ≠ productive.

steering patterns

47% of steerings are flat rejections (“no…”) — nearly half. 17% are “wait” interrupts (agent acted before confirmation).
87% recovery rate after steering — only 9.4% of steerings lead to another steering. most corrections work.
steering triggers: premature action, scope creep, forgotten flags (-run=xxx), wrong file locations, unwanted abstractions.
consecutive steerings = doom loop — 2+ in a row signals fundamental misalignment. only 15 cases of 3+ consecutive in entire corpus.
steering late in thread = scope drift — early steering about misunderstood intent. late steering about accumulated frustration.

user patterns

terse + high questions = best outcomes — @concise_commander: 263 chars, 23% questions, 60% resolution. verbose context-dumping underperforms.
marathon runners succeed — 69% of @concise_commander’s threads exceed 50 turns. persistence correlates with resolution.
socratic style works — “OK, and what is next?” keeps agent planning visible. better than frontloading dense context.
high approval:steering ratio — @steady_navigator: 3x approvals per steer, lowest steering rate (2.6%). explicit positive feedback reduces corrections.
learning is real — @verbose_explorer: 66% reduction in thread length over 8 months (68→23 avg turns).

tool patterns

oracle is a “stuck” signal — 46% of FRUSTRATED threads use oracle vs 25% of RESOLVED. reached for when already stuck, not for prevention.
Task usage correlates with frustration — 61.5% of frustrated threads use Task vs 40.5% of resolved. over-delegation when confused.
core workflow is Bash + edit_file + Read — 3 tools dominate. more messages ≠ better outcomes.
finder is underutilized — only 11% of resolved threads use it. possibly needs better prompting awareness.

failure patterns

7 failure archetypes:
- PREMATURE_COMPLETION: declaring done without verification
- OVER_ENGINEERING: unnecessary abstractions
- HACKING_AROUND_PROBLEM: fragile patches not proper fixes
- IGNORING_CODEBASE_PATTERNS: not reading reference implementations
- NO_DELEGATION: not spawning subtasks
- TEST_WEAKENING: removing assertions instead of fixing bugs
- NOT_READING_DOCS: unfamiliar library usage without docs

USER CHEAT SHEETS

for ALL users

✓ include file references (@path/to/file) in opening message
✓ aim for 26-50 turns — not too short, not marathon
✓ use imperative style ("fix X" not "i want X fixed")
✓ terse prompts + follow-up questions > verbose context dumps
✓ approve explicitly ("ship it", "commit") when satisfied
✓ steer early if off-track — corrections work 87% of the time
✓ spawn subtasks for parallel independent work
✓ use oracle at planning phase, not rescue phase
✗ don't abandon threads < 10 turns without handoff
✗ don't frontload >1500 chars (causes MORE problems)
✗ don't let steering:approval drop below 1:1 without pausing

for TERSE USERS (like @concise_commander)

✓ short commands work — 263 chars avg is fine
✓ high question rate (23%) keeps agent aligned
✓ marathon sessions (50+ turns) work for focused domains
✓ "OK, what's next?" checkpoints are effective
✓ explicit approval signals (16% of messages) reduce corrections
⚠ confirm before agent runs tests/pushes — you steer on premature action
⚠ remember benchmark flags across sessions (-run=xxx)

for SPAWN ORCHESTRATORS (like @verbose_explorer)

✓ front-loading context enables high spawn success (97.8% on 231 subagents)
✓ 83% resolution rate — top tier performer
✓ meta-work (skills, tooling) benefits from explicit commit closures
✓ verbose context (932 chars) provides rich spawn instructions
⚠ explicit "ship it" closures make threads more efficient (40% shorter)

note: prior analysis miscounted @verbose_explorer’s spawns as handoffs, incorrectly showing 30% handoff rate. corrected 2026-01-09.

for VISUAL/ITERATIVE USERS (like @steady_navigator)

✓ screenshot-driven workflow is effective
✓ polite structured prompts work — "please look at X"
✓ low steering rate (2.6%) via precise post-hoc corrections
✓ explicit file paths prevent confusion
✓ iterative visual refinement ("almost there", "still off")
⚠ 43% question ratio is high — focused work with fewer questions resolves faster

for INFRASTRUCTURE/OPERATORS (like @patient_pathfinder)

✓ 7% question ratio is optimal — most directive style
✓ concise task-focused prompts (293 chars)
✓ work hours only (07-17) → clean operational patterns
✓ low steering (0.22) via precise specs

TIME-BASED RECOMMENDATIONS

time block	resolution %	recommendation
late night (2-5am)	60.4%	best outcomes — deep focus
morning (6-9am)	59.6%	second best — fresh intent
midday (10-13)	48.0%	decent
afternoon (14-17)	43.2%	declining
evening (18-21)	27.5%	WORST — avoid for important work

weekend premium: 48.9% resolution vs 43.7% weekday (+5.2pp)

EXACT AGENTS.MD TEXT TO ADD

section: confirmation gates

## before taking action

confirm with user before:
- running tests/benchmarks (especially with flags like `-run=xxx`, `-bench=xxx`)
- pushing code or creating commits
- modifying files outside explicitly mentioned scope
- adding abstractions or changing existing behavior
- running full test suites instead of targeted tests

ASK: "ready to run the tests?" rather than "running the tests now..."

### flag memory

remember user-specified flags across the thread:
- benchmark flags: `-run=xxx`, `-bench=xxx`, `-benchstat`
- test filters: specific test names, package paths
- git conventions: avoid `git add -A`, use explicit file lists

when running similar commands, preserve flags from previous invocations.

section: steering recovery

## after receiving steering

1. acknowledge the correction explicitly
2. do NOT repeat the corrected behavior
3. if pattern recurs (2+ steerings for same issue), ask user for explicit preference
4. track common corrections for this user

### steering → recovery expectations

- 87% of steerings should NOT be followed by another steering
- if you hit 2+ consecutive steerings, PAUSE and ask if approach should change
- after STEERING → APPROVAL sequence, user has validated the correction

section: thread health monitoring

## thread health indicators

### healthy signals
- approval:steering ratio > 2:1
- steady progress with occasional approvals
- spawning subtasks for parallel work
- consistent approval distribution across phases

### warning signals
- ratio drops below 1:1 — intervention needed
- 100+ turns without resolution — marathon risk
- 2+ consecutive steerings — doom spiral forming
- user messages getting longer — frustration signal

### action when unhealthy
1. pause and summarize current state
2. ask if approach should change
3. offer to spawn fresh thread with lessons learned

section: oracle usage

## oracle usage

### DO use oracle for
- planning before implementation
- architecture decisions
- code review pre-merge
- debugging hypotheses
- early phase ideation

### DON'T use oracle as
- last resort when stuck (too late — 46% of frustrated threads reached for oracle)
- replacement for reading code
- magic fix for unclear requirements
- panic button after 100+ turns

### oracle timing
integrate EARLY (planning phase), not LATE (rescue phase). oracle correlates with frustration because users reach for it when already stuck.

section: optimal thread patterns

## optimal thread patterns

### success predictors
| metric | target | red flag |
|--------|--------|----------|
| approval:steering ratio | >2:1 | <1:1 |
| thread length | 26-50 turns | >100 without resolution |
| question density | <5% | >15% |
| steering recovery | next msg not steering | consecutive steerings |

### thread lifecycle (healthy flow)
1. scope definition (1-3 turns) — include file references
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff

### opening message best practices
- include file references (@path/to/file) — +25% success
- 300-1500 chars optimal (not too brief, not overwhelming)
- imperative style > declarative ("fix X" not "i want X")
- questions for exploration, commands for execution

section: delegation patterns

## delegation patterns

### healthy delegation
- use Task for clearly scoped, independent work
- spawn depth of 2-3 is healthy; beyond 5 suggests over-fragmentation
- each subtask should have clear scope and exit criteria

### unhealthy delegation
- spawning Task as escape hatch when confused
- delegating without clear spec
- spawning multiple concurrent tasks that touch same files
- Task usage 61.5% in frustrated vs 40.5% in resolved — over-delegation is a smell

### when to spawn
- multi-phase work: plan → implement → test → fix → verify
- parallel independent subtasks (don't touch same files)
- when stuck in single context and approach needs reset

section: anti-patterns

## anti-patterns to avoid

### premature action
acting before user confirms intent. triggers "wait..." interrupts (17% of all steerings).

❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ choosing file locations without asking

### scope creep
making changes beyond what user asked.

❌ running full test suite instead of targeted tests
❌ adding unwanted abstractions
❌ changing preserved behavior ("WTF. Keep using FillVector!")
❌ refactoring working code while fixing unrelated issue

### test weakening
removing/weakening assertions to make tests pass instead of fixing underlying bugs.

❌ "the agent is drunk and keeps trying to 'fix' the failing test by removing the failing assertion"

### oracle as panic button
reaching for oracle only when already stuck correlates with frustration, not resolution.

### context overload
>1500 char opening messages paradoxically cause MORE steering and longer threads than 300-700 char messages.

section: user-specific preferences

## user-specific patterns (learned)

### @concise_commander
- terse commands, high question rate (23%)
- 20% "wait" interrupts — confirm before every action
- benchmark-heavy — ALWAYS remember `-run=xxx` flags
- marathon debugging sessions (50+ turns) are intentional workflow
- phrases: "DO NOT change it", "fix the tests", "commit"

### @steady_navigator
- 1% "wait" interrupts — more tolerant of autonomous action
- polite structured prompts ("please look at")
- screenshot-driven, iterative visual refinement
- explicit file paths expected
- post-hoc correction style vs interrupt

### @verbose_explorer
- verbose context frontloading (932 chars avg)
- meta-work focus: skills, tooling, infrastructure
- **power spawn user** — 231 subagents at 97.8% success, 83% resolution
- cares about thread organization, spawning
- phrases: "search my amp threads", "ship it"

*note: prior analysis miscounted spawned subagent threads as handoffs, incorrectly showing 30% handoff rate. corrected 2026-01-09.*

QUICK REFERENCE CARD

┌─────────────────────────────────────────────────────────────────┐
│                    AMP THREAD SUCCESS FACTORS                    │
├─────────────────────────────────────────────────────────────────┤
│ ✓ file references (@path) → +25% success                        │
│ ✓ 300-1500 char prompts → lowest steering                       │
│ ✓ 26-50 turns → 75% success rate                                │
│ ✓ approval:steering >2:1 → healthy thread                       │
│ ✓ "ship it" / "commit" → explicit checkpoints                   │
├─────────────────────────────────────────────────────────────────┤
│ ✗ <10 turns → 14% success (abandoned)                           │
│ ✗ >100 turns → frustration risk increases                       │
│ ✗ ratio <1:1 → doom spiral, pause and realign                   │
│ ✗ 2+ consecutive steerings → fundamental misalignment           │
│ ✗ oracle as last resort → too late, use for planning            │
├─────────────────────────────────────────────────────────────────┤
│ BEST TIMES: 2-5am (60%), 6-9am (59%)                            │
│ WORST TIME: 6-9pm (27%) — avoid for critical work               │
├─────────────────────────────────────────────────────────────────┤
│ STEERING TAXONOMY                                               │
│ 47% "no..." (rejection) | 17% "wait..." (premature action)     │
│ 8% "don't..." | 3% "actually..." | 2% "stop..."                │
└─────────────────────────────────────────────────────────────────┘

METRICS TO TRACK (if instrumented)

metric	target	red flag
steering rate per thread	<5%	>8%
approval:steering ratio	>2:1	<1:1
recovery rate after steering	>85%	<70%
consecutive steering count	0-1	>2
thread spawn depth	2-3	>5
opening message file refs	present	absent
opening message length	300-1500 chars	<100 or >2000

SOURCES

synthesized from:

first-message-patterns.md
learning-curves.md
length-analysis.md
error-analysis.md
message-brevity.md
conversation-dynamics.md
steering-deep-dive.md
@verbose_explorer-specific.md
tool-patterns.md
user-comparison.md
time-analysis.md
skill-usage.md
web-research-nlp.md
failure-autopsy.md
SYNTHESIS.md
agents-md-recommendations.md
question-analysis.md
thread-flow.md
web-research-human-ai.md
web-research-personality.md
approval-triggers.md
user-profiles.md
vocabulary-analysis.md

synthesized by jack_ribbonsun | 2026-01-09