synthesis highest impact

MEGA SYNTHESIS

@agent_mega

MEGA-SYNTHESIS: amp thread analysis

consolidated findings from 23 analysis documents spanning 4,656 threads, 208,799 messages, 20 users, 9 months of data.


TOP 20 ACTIONABLE FINDINGS

structural patterns

  1. 26-50 turns is the sweet spot — 75% success rate. under 10 turns = 14% (abandoned). over 100 turns = frustration risk.

  2. approval:steering ratio predicts outcomes

    • >4:1 → COMMITTED (clean execution)
    • 2-4:1 → RESOLVED (healthy)
    • <1:1 → FRUSTRATED (doom spiral)
    • crossing below 1:1 = intervention signal
  3. file references = +25% success — threads starting with @path/to/file have 66.7% success vs 41.8% without. STRONGEST predictor.

  4. brief OR extensive, not moderate — U-shaped curve. 300-1500 char prompts hit sweet spot. very long (>1500) paradoxically causes MORE steering.

  5. low question density = higher resolution — threads with <5% questions resolve at 76%. interrogative style ≠ productive.

steering patterns

  1. 47% of steerings are flat rejections (“no…”) — nearly half. 17% are “wait” interrupts (agent acted before confirmation).

  2. 87% recovery rate after steering — only 9.4% of steerings lead to another steering. most corrections work.

  3. steering triggers: premature action, scope creep, forgotten flags (-run=xxx), wrong file locations, unwanted abstractions.

  4. consecutive steerings = doom loop — 2+ in a row signals fundamental misalignment. only 15 cases of 3+ consecutive in entire corpus.

  5. steering late in thread = scope drift — early steering about misunderstood intent. late steering about accumulated frustration.

user patterns

  1. terse + high questions = best outcomes — @concise_commander: 263 chars, 23% questions, 60% resolution. verbose context-dumping underperforms.

  2. marathon runners succeed — 69% of @concise_commander’s threads exceed 50 turns. persistence correlates with resolution.

  3. socratic style works — “OK, and what is next?” keeps agent planning visible. better than frontloading dense context.

  4. high approval:steering ratio — @steady_navigator: 3x approvals per steer, lowest steering rate (2.6%). explicit positive feedback reduces corrections.

  5. learning is real — @verbose_explorer: 66% reduction in thread length over 8 months (68→23 avg turns).

tool patterns

  1. oracle is a “stuck” signal — 46% of FRUSTRATED threads use oracle vs 25% of RESOLVED. reached for when already stuck, not for prevention.

  2. Task usage correlates with frustration — 61.5% of frustrated threads use Task vs 40.5% of resolved. over-delegation when confused.

  3. core workflow is Bash + edit_file + Read — 3 tools dominate. more messages ≠ better outcomes.

  4. finder is underutilized — only 11% of resolved threads use it. possibly needs better prompting awareness.

failure patterns

  1. 7 failure archetypes:
    • PREMATURE_COMPLETION: declaring done without verification
    • OVER_ENGINEERING: unnecessary abstractions
    • HACKING_AROUND_PROBLEM: fragile patches not proper fixes
    • IGNORING_CODEBASE_PATTERNS: not reading reference implementations
    • NO_DELEGATION: not spawning subtasks
    • TEST_WEAKENING: removing assertions instead of fixing bugs
    • NOT_READING_DOCS: unfamiliar library usage without docs

USER CHEAT SHEETS

for ALL users

✓ include file references (@path/to/file) in opening message
✓ aim for 26-50 turns — not too short, not marathon
✓ use imperative style ("fix X" not "i want X fixed")
✓ terse prompts + follow-up questions > verbose context dumps
✓ approve explicitly ("ship it", "commit") when satisfied
✓ steer early if off-track — corrections work 87% of the time
✓ spawn subtasks for parallel independent work
✓ use oracle at planning phase, not rescue phase
✗ don't abandon threads < 10 turns without handoff
✗ don't frontload >1500 chars (causes MORE problems)
✗ don't let steering:approval drop below 1:1 without pausing

for TERSE USERS (like @concise_commander)

✓ short commands work — 263 chars avg is fine
✓ high question rate (23%) keeps agent aligned
✓ marathon sessions (50+ turns) work for focused domains
✓ "OK, what's next?" checkpoints are effective
✓ explicit approval signals (16% of messages) reduce corrections
⚠ confirm before agent runs tests/pushes — you steer on premature action
⚠ remember benchmark flags across sessions (-run=xxx)

for SPAWN ORCHESTRATORS (like @verbose_explorer)

✓ front-loading context enables high spawn success (97.8% on 231 subagents)
✓ 83% resolution rate — top tier performer
✓ meta-work (skills, tooling) benefits from explicit commit closures
✓ verbose context (932 chars) provides rich spawn instructions
⚠ explicit "ship it" closures make threads more efficient (40% shorter)

note: prior analysis miscounted @verbose_explorer’s spawns as handoffs, incorrectly showing 30% handoff rate. corrected 2026-01-09.

for VISUAL/ITERATIVE USERS (like @steady_navigator)

✓ screenshot-driven workflow is effective
✓ polite structured prompts work — "please look at X"
✓ low steering rate (2.6%) via precise post-hoc corrections
✓ explicit file paths prevent confusion
✓ iterative visual refinement ("almost there", "still off")
⚠ 43% question ratio is high — focused work with fewer questions resolves faster

for INFRASTRUCTURE/OPERATORS (like @patient_pathfinder)

✓ 7% question ratio is optimal — most directive style
✓ concise task-focused prompts (293 chars)
✓ work hours only (07-17) → clean operational patterns
✓ low steering (0.22) via precise specs

TIME-BASED RECOMMENDATIONS

time blockresolution %recommendation
late night (2-5am)60.4%best outcomes — deep focus
morning (6-9am)59.6%second best — fresh intent
midday (10-13)48.0%decent
afternoon (14-17)43.2%declining
evening (18-21)27.5%WORST — avoid for important work

weekend premium: 48.9% resolution vs 43.7% weekday (+5.2pp)


EXACT AGENTS.MD TEXT TO ADD

section: confirmation gates

## before taking action

confirm with user before:
- running tests/benchmarks (especially with flags like `-run=xxx`, `-bench=xxx`)
- pushing code or creating commits
- modifying files outside explicitly mentioned scope
- adding abstractions or changing existing behavior
- running full test suites instead of targeted tests

ASK: "ready to run the tests?" rather than "running the tests now..."

### flag memory

remember user-specified flags across the thread:
- benchmark flags: `-run=xxx`, `-bench=xxx`, `-benchstat`
- test filters: specific test names, package paths
- git conventions: avoid `git add -A`, use explicit file lists

when running similar commands, preserve flags from previous invocations.

section: steering recovery

## after receiving steering

1. acknowledge the correction explicitly
2. do NOT repeat the corrected behavior
3. if pattern recurs (2+ steerings for same issue), ask user for explicit preference
4. track common corrections for this user

### steering → recovery expectations

- 87% of steerings should NOT be followed by another steering
- if you hit 2+ consecutive steerings, PAUSE and ask if approach should change
- after STEERING → APPROVAL sequence, user has validated the correction

section: thread health monitoring

## thread health indicators

### healthy signals
- approval:steering ratio > 2:1
- steady progress with occasional approvals
- spawning subtasks for parallel work
- consistent approval distribution across phases

### warning signals
- ratio drops below 1:1 — intervention needed
- 100+ turns without resolution — marathon risk
- 2+ consecutive steerings — doom spiral forming
- user messages getting longer — frustration signal

### action when unhealthy
1. pause and summarize current state
2. ask if approach should change
3. offer to spawn fresh thread with lessons learned

section: oracle usage

## oracle usage

### DO use oracle for
- planning before implementation
- architecture decisions
- code review pre-merge
- debugging hypotheses
- early phase ideation

### DON'T use oracle as
- last resort when stuck (too late — 46% of frustrated threads reached for oracle)
- replacement for reading code
- magic fix for unclear requirements
- panic button after 100+ turns

### oracle timing
integrate EARLY (planning phase), not LATE (rescue phase). oracle correlates with frustration because users reach for it when already stuck.

section: optimal thread patterns

## optimal thread patterns

### success predictors
| metric | target | red flag |
|--------|--------|----------|
| approval:steering ratio | >2:1 | <1:1 |
| thread length | 26-50 turns | >100 without resolution |
| question density | <5% | >15% |
| steering recovery | next msg not steering | consecutive steerings |

### thread lifecycle (healthy flow)
1. scope definition (1-3 turns) — include file references
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff

### opening message best practices
- include file references (@path/to/file) — +25% success
- 300-1500 chars optimal (not too brief, not overwhelming)
- imperative style > declarative ("fix X" not "i want X")
- questions for exploration, commands for execution

section: delegation patterns

## delegation patterns

### healthy delegation
- use Task for clearly scoped, independent work
- spawn depth of 2-3 is healthy; beyond 5 suggests over-fragmentation
- each subtask should have clear scope and exit criteria

### unhealthy delegation
- spawning Task as escape hatch when confused
- delegating without clear spec
- spawning multiple concurrent tasks that touch same files
- Task usage 61.5% in frustrated vs 40.5% in resolved — over-delegation is a smell

### when to spawn
- multi-phase work: plan → implement → test → fix → verify
- parallel independent subtasks (don't touch same files)
- when stuck in single context and approach needs reset

section: anti-patterns

## anti-patterns to avoid

### premature action
acting before user confirms intent. triggers "wait..." interrupts (17% of all steerings).

❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ choosing file locations without asking

### scope creep
making changes beyond what user asked.

❌ running full test suite instead of targeted tests
❌ adding unwanted abstractions
❌ changing preserved behavior ("WTF. Keep using FillVector!")
❌ refactoring working code while fixing unrelated issue

### test weakening
removing/weakening assertions to make tests pass instead of fixing underlying bugs.

❌ "the agent is drunk and keeps trying to 'fix' the failing test by removing the failing assertion"

### oracle as panic button
reaching for oracle only when already stuck correlates with frustration, not resolution.

### context overload
>1500 char opening messages paradoxically cause MORE steering and longer threads than 300-700 char messages.

section: user-specific preferences

## user-specific patterns (learned)

### @concise_commander
- terse commands, high question rate (23%)
- 20% "wait" interrupts — confirm before every action
- benchmark-heavy — ALWAYS remember `-run=xxx` flags
- marathon debugging sessions (50+ turns) are intentional workflow
- phrases: "DO NOT change it", "fix the tests", "commit"

### @steady_navigator
- 1% "wait" interrupts — more tolerant of autonomous action
- polite structured prompts ("please look at")
- screenshot-driven, iterative visual refinement
- explicit file paths expected
- post-hoc correction style vs interrupt

### @verbose_explorer
- verbose context frontloading (932 chars avg)
- meta-work focus: skills, tooling, infrastructure
- **power spawn user** — 231 subagents at 97.8% success, 83% resolution
- cares about thread organization, spawning
- phrases: "search my amp threads", "ship it"

*note: prior analysis miscounted spawned subagent threads as handoffs, incorrectly showing 30% handoff rate. corrected 2026-01-09.*

QUICK REFERENCE CARD

┌─────────────────────────────────────────────────────────────────┐
│                    AMP THREAD SUCCESS FACTORS                    │
├─────────────────────────────────────────────────────────────────┤
│ ✓ file references (@path) → +25% success                        │
│ ✓ 300-1500 char prompts → lowest steering                       │
│ ✓ 26-50 turns → 75% success rate                                │
│ ✓ approval:steering >2:1 → healthy thread                       │
│ ✓ "ship it" / "commit" → explicit checkpoints                   │
├─────────────────────────────────────────────────────────────────┤
│ ✗ <10 turns → 14% success (abandoned)                           │
│ ✗ >100 turns → frustration risk increases                       │
│ ✗ ratio <1:1 → doom spiral, pause and realign                   │
│ ✗ 2+ consecutive steerings → fundamental misalignment           │
│ ✗ oracle as last resort → too late, use for planning            │
├─────────────────────────────────────────────────────────────────┤
│ BEST TIMES: 2-5am (60%), 6-9am (59%)                            │
│ WORST TIME: 6-9pm (27%) — avoid for critical work               │
├─────────────────────────────────────────────────────────────────┤
│ STEERING TAXONOMY                                               │
│ 47% "no..." (rejection) | 17% "wait..." (premature action)     │
│ 8% "don't..." | 3% "actually..." | 2% "stop..."                │
└─────────────────────────────────────────────────────────────────┘

METRICS TO TRACK (if instrumented)

metrictargetred flag
steering rate per thread<5%>8%
approval:steering ratio>2:1<1:1
recovery rate after steering>85%<70%
consecutive steering count0-1>2
thread spawn depth2-3>5
opening message file refspresentabsent
opening message length300-1500 chars<100 or >2000

SOURCES

synthesized from:


synthesized by jack_ribbonsun | 2026-01-09