AGENTS.md recommendations
synthesized from analysis of 4,281 threads, 208,799 messages, 901 steering events, 2,050 approval events.
executive summary
the data reveals a clear pattern: iterative, explicit collaboration beats passive acceptance. users who steer achieve ~60% resolution vs 37% for those who don’t. but excessive steering (>1:1 steering:approval ratio) signals frustration. the sweet spot is active engagement with high approval internal_app.
behaviors to ENCOURAGE
1. confirmation before action
evidence: 46.7% of steerings start with “no”, 16.6% with “wait”. users steer most when agent rushes ahead without confirmation.
recommendation:
## execution protocol
before running tests, pushing code, or making significant changes, confirm with user first unless:
- explicitly told to proceed autonomously
- the action is clearly part of an approved plan
- the change is trivial and easily reversible
ASK: "ready to run the tests?" rather than "running the tests now..."
2. scope discipline
evidence: trigger analysis shows “scope creep” and “running full test suite instead of targeted tests” as common steering triggers.
recommendation:
## scope management
- when asked to do X, do X only
- if you notice related improvements, mention them but don't implement unless asked
- for tests: use specific test flags (-run=xxx) rather than running entire suites
- when in doubt about scope, ask
3. flag/option memory
evidence: “You forgot to -run=xxx” is a recurring correction. common flags include -run=xxx, specific filter params, benchmark options.
recommendation:
## command patterns
remember user-specified flags across the thread:
- benchmark flags: -run=xxx, -bench=xxx, -benchstat
- test filters: specific test names, package paths
- git conventions: avoid git add -A, use explicit file lists
when running similar commands, preserve flags from previous invocations unless user changes them.
4. file location verification
evidence: “No, not in float_test.go. Should go in column_test.go” — users steer on file placement.
recommendation:
## file operations
before writing to a file, especially for new code:
- verify the target file/directory with user
- for tests: confirm whether to add to existing test file or create new one
- for components: check naming conventions in adjacent files
5. thread spawning for complex work
evidence: threads that spawn subtasks correlate with deeper, more successful work. max chain depth observed: 5 levels. top spawners produce 20-32 child threads.
recommendation:
## thread structure
for complex multi-phase work:
- use Task tool to spawn focused subtasks
- each subtask should have clear scope and exit criteria
- spawn depth of 2-3 is healthy; beyond 5 suggests over-fragmentation
- when stuck in a single context, consider spawning a fresh subtask
6. uniform internal_app
evidence: successful threads maintain consistent approval distribution across phases (early: 1.85, middle: 1.91, late: 1.87). no front-loading or back-loading.
recommendation:
## pacing
- seek small, frequent confirmations rather than large batches
- if you haven't received feedback in several turns, pause and check in
- don't batch multiple changes before showing user
behaviors to AVOID
1. premature action
evidence: “Wait a fucking second, you responded to all of that without confirming with me?” — strongest steering language appears here.
anti-pattern:
❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ making changes beyond asked scope without flagging
2. git add -A and blanket operations
evidence: “Revert. NEVER EVER use git add -A” — explicit user rule.
anti-pattern:
❌ git add -A (always use explicit file lists)
❌ running full test suites when specific tests requested
❌ global find-replace without confirmation
3. over-delegation to Task
evidence: Task usage is HIGHER in FRUSTRATED threads (61.5%) than RESOLVED (40.5%). suggests over-delegation when stuck.
anti-pattern:
❌ spawning Task as escape hatch when confused
❌ delegating without clear spec
❌ spawning multiple concurrent tasks that touch same files
healthy pattern: use Task for clearly scoped, independent work—not as a crutch.
4. oracle as last resort
evidence: FRUSTRATED threads use oracle MORE (46.2%) than RESOLVED (25%). oracle is reached for when already stuck.
anti-pattern:
❌ calling oracle only when things go wrong
healthy pattern: use oracle early for planning, not late for rescue.
5. changing preserved behavior
evidence: “WTF. Keep using FillVector!” — users expect existing patterns preserved unless explicitly changing.
anti-pattern:
❌ refactoring working code while fixing unrelated issue
❌ changing API signatures without explicit request
❌ "improving" existing implementations unprompted
optimal thread patterns
success predictors
| metric | target | red flag |
|---|---|---|
| approval:steering ratio | >2:1 | <1:1 |
| thread length | 26-50 turns | >100 turns without resolution |
| question density | <5% | >15% |
| steering recovery | 87% next msg not steering | consecutive steerings |
thread lifecycle phases
healthy flow:
1. scope definition (1-3 turns)
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff
frustrated flow (avoid):
1. vague scope
2. agent assumes approach
3. user steers
4. agent overcorrects
5. user steers again
6. thrashing continues
conversation starters matter less than follow-up
evidence: 88.7% of questions are follow-ups, not openers. threads succeed through context accumulation, not initial framing.
user-specific patterns worth noting
high-volume users (concise_commander, verbose_explorer, steady_navigator)
| user | style | implication |
|---|---|---|
| concise_commander | 20% “wait” interrupts, heavy steering | prefers explicit control; confirm before every action |
| steady_navigator | 1% “wait”, prefers post-hoc rejection | more tolerant of autonomous action, corrects after |
| verbose_explorer | context/thread management focus | cares about thread organization, spawning |
steering vocabulary by user
- concise_commander: “wait”, “dont”, “nope”, technical corrections
- verbose_explorer: “context”, “thread”, “window”, “rules” — meta-level concerns
implementation checklist
## quick reference
□ confirm before running tests/pushing
□ use specific flags, not defaults
□ verify file targets before writing
□ preserve existing behavior unless asked to change
□ seek frequent small approvals
□ spawn subtasks for parallel work
□ use oracle early for planning
□ if steering:approval drops below 1:1, pause and realign
metrics to track (if instrumented)
- steering rate per thread (target: <5%)
- approval:steering ratio (target: >2:1)
- recovery rate after steering (target: >85%)
- consecutive steering count (red flag: >2)
- thread spawn depth (healthy: 2-3)
sources
- patterns.json: 901 steering, 2,050 approval messages
- steering-deep-dive.md: taxonomy of 1,434 steering events
- thread-flow.md: outcome analysis of 4,281 threads
- tool-patterns.md: 185,537 assistant messages
- question-analysis.md: 4,600 question patterns
- web-research-human-ai.md: academic research on iterative collaboration