AGENTS.md additions
synthesized from analysis of 4,656 threads, 208,799 messages, 1,434 steering events, 2,050 approval events.
before taking action
confirm with user before:
- running tests/benchmarks (especially with flags like
-run=xxx,-bench=xxx) - pushing code or creating commits
- modifying files outside explicitly mentioned scope
- adding abstractions or changing existing behavior
- running full test suites instead of targeted tests
ASK: “ready to run the tests?” rather than “running the tests now…“
flag memory
remember user-specified flags across the thread:
- benchmark flags:
-run=xxx,-bench=xxx,-benchstat - test filters: specific test names, package paths
- git conventions: avoid
git add -A, use explicit file lists
when running similar commands, preserve flags from previous invocations.
scope management
- when asked to do X, do X only
- if you notice related improvements, mention them but don’t implement unless asked
- for tests: use specific test flags (
-run=xxx) rather than running entire suites - before writing to a file: verify the target file/directory with user, especially for new code
- preserve existing behavior by default — don’t refactor working code while fixing unrelated issues
after receiving steering
- acknowledge the correction explicitly
- do NOT repeat the corrected behavior
- if pattern recurs (2+ steerings for same issue), ask user for explicit preference
- track common corrections for this user
recovery expectations
- 87% of steerings should NOT be followed by another steering
- if you hit 2+ consecutive steerings, PAUSE and ask if approach should change
- after STEERING → APPROVAL sequence, user has validated the correction
thread health indicators
healthy signals
- approval:steering ratio > 2:1
- steady progress with occasional approvals
- spawning subtasks for parallel independent work
- consistent approval distribution across phases
warning signals
- approval:steering ratio < 1:1 — intervention needed
- 2+ consecutive steerings — doom spiral forming
- 100+ turns without resolution — marathon risk
- user messages getting longer — frustration signal
- high steering density (>8% of messages)
action when unhealthy
- pause and summarize current state
- ask if approach should change
- offer to spawn fresh thread with lessons le@swift_solverd
oracle usage
DO use oracle for
- planning before implementation
- architecture decisions
- code review pre-merge
- debugging hypotheses
- early phase ideation
DON’T use oracle as
- last resort when stuck (too late—46% of frustrated threads reached for oracle)
- replacement for reading code
- magic fix for unclear requirements
- panic button after 100+ turns
integrate EARLY (planning phase), not LATE (rescue phase).
task delegation
optimal patterns
- spawn 2-6 tasks for parallel independent work (77-79% success)
- each subtask should have clear scope and exit criteria
- spawn depth of 2-3 is healthy; beyond 5 suggests over-fragmentation
anti-patterns
- spawning Task as escape hatch when confused
- delegating without clear spec
- spawning multiple concurrent tasks that touch same files
failure modes to avoid
| archetype | trigger | fix |
|---|---|---|
| PREMATURE_COMPLETION | declaring done without verification | always run tests before claiming complete |
| OVER_ENGINEERING | adding unnecessary abstractions | question every exposed prop/method |
| SIMPLIFICATION_ESCAPE | reducing requirements when stuck | persist with debugging, not scope reduction |
| TEST_WEAKENING | removing assertions instead of fixing bugs | NEVER modify expected values without fixing impl |
| HACKING_AROUND_PROBLEM | fragile patches not proper fixes | read docs, understand root cause |
| IGNORING_CODEBASE_PATTERNS | not reading reference implementations | read files user provides FIRST |
steering taxonomy
| pattern | frequency | response |
|---|---|---|
| ”No…“ | 47% | flat rejection — acknowledge, reverse course |
| ”Wait…“ | 17% | premature action — confirm before continuing |
| ”Don’t…“ | 8% | explicit prohibition — add to user prefs |
| ”Actually…“ | 3% | course correction — acknowledge, adjust |
| ”Stop…“ | 2% | halt current action — immediate pause |
| ”WTF…“ | 1% | frustration signal — PAUSE, meta-acknowledge, realign |
quick reference metrics
| metric | target | caution | danger |
|---|---|---|---|
| approval:steering ratio | >2:1 | 1-2:1 | <1:1 |
| steering rate per thread | <5% | 5-8% | >8% |
| recovery rate (next msg not steering) | >85% | 70-85% | <70% |
| consecutive steerings | 0-1 | 2 | 3+ |
| thread spawn depth | 2-3 | 4-5 | >5 |
| opening message file refs | present | — | absent |
| prompt length | 300-1500 chars | 100-300, 1500-2000 | <100 or >2000 |
checklist
- confirm before running tests/pushing
- use specific flags, not defaults
- verify file targets before writing
- preserve existing behavior unless asked to change
- seek frequent small approvals
- spawn subtasks for parallel work (2-6 optimal)
- use oracle early for planning
- if steering:approval drops below 1:1, pause and realign
- after steering, acknowledge and DO NOT repeat the behavior
- never weaken tests — debug root cause instead
- read reference files BEFORE responding when user provides paths
corpus: 4,656 threads | 208,799 messages | 20 users | may 2025 – jan 2026