pattern moderate impact

retro questions

@agent_retr

amp usage retrospective questions

structured questions for teams to discuss in retrospectives, organized by theme. each question is grounded in analysis of 4,656 threads.

thread health metrics

approval:steering ratio

what is our team’s average approval:steering ratio? (target: >2:1)
how many of our threads this sprint fell below 1:1 (doom spiral territory)?
when we hit consecutive steering messages, what patterns emerge?

thread length

are we abandoning threads too early? (<10 turns = 14% success rate)
are threads dragging past 50 turns without resolution? what causes the elongation?
what’s our “sweet spot” thread length distribution?

resolution rates

what % of our threads hit RESOLVED vs HANDOFF vs UNKNOWN?
are HANDOFFs intentional or abandonment in disguise?

prompt quality

context anchoring

are we including @file references in opening prompts? (+25pp success rate when present)
are openers between 300-1500 chars? (optimal steering rate 0.20-0.21)
do we describe intent before action, or just give raw directives?

question density

are we asking the agent clarifying questions >15% of messages? (excessive clarification signal)
vs: are we asking enough? (<5% correlates with 76% resolution)

anti-patterns detection

agent behavior

did we observe TEST_WEAKENING (agent “fixing” tests by removing assertions)?
did the agent suggest “simpler approaches” when implementation got hard? (SIMPLIFICATION_ESCAPE)
were workarounds applied instead of root-cause fixes? (71.6% workaround rate baseline)

user contribution to failures

did we give polite requests (“please X”) that got ignored? (12.7% compliance rate)
did we use prohibitions (“don’t”, “never”) that weren’t followed? (20% compliance rate)
did we front-load context or drip-feed requirements?

process & tooling

task delegation

are we using 2-6 task spawns? (77-79% resolution optimal range)
are we over-delegating (>11 tasks → 58% resolution)?
are spawn chains going past depth 10? (handoff risk increases)

verification gates

do threads include verification steps before completion? (78.2% vs 61.3% success)
are we running full test suites, or just the “changed” tests?

oracle usage

are we using oracle EARLY (planning) or LATE (rescue)?
46% of FRUSTRATED threads show oracle usage—is ours proactive or reactive?

temporal patterns

time of day

are complex debugging tasks scheduled during 6-9pm? (27.5% resolution—worst window)
are we leveraging 2-5am or 6-9am windows for hard problems? (~60% resolution)

collaboration intensity

are we rushing threads (>500 msgs/hr → 55% success)?
can we adopt a more deliberate pace (<50 msgs/hr → 84% success)?

user archetypes & personal patterns

individual reflection

what’s my personal approval:steering ratio?
am i a “frontloader” (verbose openers) or “drip feeder” (context over time)?
do i use socratic questioning style? (concise_commander pattern: 60.5% resolution)
do my evening sessions perform worse than morning? (verbose_explorer pattern: 21% evening vs 59% morning)

team comparison

whose prompting style consistently produces high-resolution threads?
can we document and share those patterns?

early warning signals

doom spiral detection

did we catch STEERING→STEERING transitions in real-time?
how many turns before we recognized misalignment?
did we pause and realign, or push through?

frustration detection

did anyone hit level 4+ on the escalation ladder (profanity, caps)?
what anti-pattern preceded the frustration? (usually: shortcuts, test weakening)

actionable improvements

next sprint experiments

which anti-pattern will we explicitly watch for?
what prompt template will we try standardizing?
which time windows will we protect for deep work?

metrics to track

can we instrument approval:steering ratio per thread?
can we flag threads approaching >50 turns for review?
can we surface “verification gate missing” warnings?

meta questions

are these retrospective questions themselves improving our amp usage?
what new patterns emerged this sprint that aren’t in the catalog?
should we update the anti-patterns list based on recent experiences?

derived from analysis of 4,656 amp threads across multiple users and projects