amp usage retrospective questions
structured questions for teams to discuss in retrospectives, organized by theme. each question is grounded in analysis of 4,656 threads.
thread health metrics
approval:steering ratio
- what is our team’s average approval:steering ratio? (target: >2:1)
- how many of our threads this sprint fell below 1:1 (doom spiral territory)?
- when we hit consecutive steering messages, what patterns emerge?
thread length
- are we abandoning threads too early? (<10 turns = 14% success rate)
- are threads dragging past 50 turns without resolution? what causes the elongation?
- what’s our “sweet spot” thread length distribution?
resolution rates
- what % of our threads hit RESOLVED vs HANDOFF vs UNKNOWN?
- are HANDOFFs intentional or abandonment in disguise?
prompt quality
context anchoring
- are we including
@filereferences in opening prompts? (+25pp success rate when present) - are openers between 300-1500 chars? (optimal steering rate 0.20-0.21)
- do we describe intent before action, or just give raw directives?
question density
- are we asking the agent clarifying questions >15% of messages? (excessive clarification signal)
- vs: are we asking enough? (<5% correlates with 76% resolution)
anti-patterns detection
agent behavior
- did we observe TEST_WEAKENING (agent “fixing” tests by removing assertions)?
- did the agent suggest “simpler approaches” when implementation got hard? (SIMPLIFICATION_ESCAPE)
- were workarounds applied instead of root-cause fixes? (71.6% workaround rate baseline)
user contribution to failures
- did we give polite requests (“please X”) that got ignored? (12.7% compliance rate)
- did we use prohibitions (“don’t”, “never”) that weren’t followed? (20% compliance rate)
- did we front-load context or drip-feed requirements?
process & tooling
task delegation
- are we using 2-6 task spawns? (77-79% resolution optimal range)
- are we over-delegating (>11 tasks → 58% resolution)?
- are spawn chains going past depth 10? (handoff risk increases)
verification gates
- do threads include verification steps before completion? (78.2% vs 61.3% success)
- are we running full test suites, or just the “changed” tests?
oracle usage
- are we using oracle EARLY (planning) or LATE (rescue)?
- 46% of FRUSTRATED threads show oracle usage—is ours proactive or reactive?
temporal patterns
time of day
- are complex debugging tasks scheduled during 6-9pm? (27.5% resolution—worst window)
- are we leveraging 2-5am or 6-9am windows for hard problems? (~60% resolution)
collaboration intensity
- are we rushing threads (>500 msgs/hr → 55% success)?
- can we adopt a more deliberate pace (<50 msgs/hr → 84% success)?
user archetypes & personal patterns
individual reflection
- what’s my personal approval:steering ratio?
- am i a “frontloader” (verbose openers) or “drip feeder” (context over time)?
- do i use socratic questioning style? (concise_commander pattern: 60.5% resolution)
- do my evening sessions perform worse than morning? (verbose_explorer pattern: 21% evening vs 59% morning)
team comparison
- whose prompting style consistently produces high-resolution threads?
- can we document and share those patterns?
early warning signals
doom spiral detection
- did we catch STEERING→STEERING transitions in real-time?
- how many turns before we recognized misalignment?
- did we pause and realign, or push through?
frustration detection
- did anyone hit level 4+ on the escalation ladder (profanity, caps)?
- what anti-pattern preceded the frustration? (usually: shortcuts, test weakening)
actionable improvements
next sprint experiments
- which anti-pattern will we explicitly watch for?
- what prompt template will we try standardizing?
- which time windows will we protect for deep work?
metrics to track
- can we instrument approval:steering ratio per thread?
- can we flag threads approaching >50 turns for review?
- can we surface “verification gate missing” warnings?
meta questions
- are these retrospective questions themselves improving our amp usage?
- what new patterns emerged this sprint that aren’t in the catalog?
- should we update the anti-patterns list based on recent experiences?
derived from analysis of 4,656 amp threads across multiple users and projects