AGENTS.md recommendations

synthesized from analysis of 4,281 threads, 208,799 messages, 901 steering events, 2,050 approval events.

executive summary

the data reveals a clear pattern: iterative, explicit collaboration beats passive acceptance. users who steer achieve ~60% resolution vs 37% for those who don’t. but excessive steering (>1:1 steering:approval ratio) signals frustration. the sweet spot is active engagement with high approval internal_app.

behaviors to ENCOURAGE

1. confirmation before action

evidence: 46.7% of steerings start with “no”, 16.6% with “wait”. users steer most when agent rushes ahead without confirmation.

recommendation:

## execution protocol

before running tests, pushing code, or making significant changes, confirm with user first unless:
- explicitly told to proceed autonomously
- the action is clearly part of an approved plan
- the change is trivial and easily reversible

ASK: "ready to run the tests?" rather than "running the tests now..."

2. scope discipline

evidence: trigger analysis shows “scope creep” and “running full test suite instead of targeted tests” as common steering triggers.

recommendation:

## scope management

- when asked to do X, do X only
- if you notice related improvements, mention them but don't implement unless asked
- for tests: use specific test flags (-run=xxx) rather than running entire suites
- when in doubt about scope, ask

3. flag/option memory

evidence: “You forgot to -run=xxx” is a recurring correction. common flags include -run=xxx, specific filter params, benchmark options.

recommendation:

## command patterns

remember user-specified flags across the thread:
- benchmark flags: -run=xxx, -bench=xxx, -benchstat
- test filters: specific test names, package paths
- git conventions: avoid git add -A, use explicit file lists

when running similar commands, preserve flags from previous invocations unless user changes them.

4. file location verification

evidence: “No, not in float_test.go. Should go in column_test.go” — users steer on file placement.

recommendation:

## file operations

before writing to a file, especially for new code:
- verify the target file/directory with user
- for tests: confirm whether to add to existing test file or create new one
- for components: check naming conventions in adjacent files

5. thread spawning for complex work

evidence: threads that spawn subtasks correlate with deeper, more successful work. max chain depth observed: 5 levels. top spawners produce 20-32 child threads.

recommendation:

## thread structure

for complex multi-phase work:
- use Task tool to spawn focused subtasks
- each subtask should have clear scope and exit criteria
- spawn depth of 2-3 is healthy; beyond 5 suggests over-fragmentation
- when stuck in a single context, consider spawning a fresh subtask

6. uniform internal_app

evidence: successful threads maintain consistent approval distribution across phases (early: 1.85, middle: 1.91, late: 1.87). no front-loading or back-loading.

recommendation:

## pacing

- seek small, frequent confirmations rather than large batches
- if you haven't received feedback in several turns, pause and check in
- don't batch multiple changes before showing user

behaviors to AVOID

1. premature action

evidence: “Wait a fucking second, you responded to all of that without confirming with me?” — strongest steering language appears here.

anti-pattern:

❌ "Now let's run the tests to see if this fixes..."
❌ pushing code before user reviews
❌ making changes beyond asked scope without flagging

2. git add -A and blanket operations

evidence: “Revert. NEVER EVER use git add -A” — explicit user rule.

anti-pattern:

❌ git add -A (always use explicit file lists)
❌ running full test suites when specific tests requested
❌ global find-replace without confirmation

3. over-delegation to Task

evidence: Task usage is HIGHER in FRUSTRATED threads (61.5%) than RESOLVED (40.5%). suggests over-delegation when stuck.

anti-pattern:

❌ spawning Task as escape hatch when confused
❌ delegating without clear spec
❌ spawning multiple concurrent tasks that touch same files

healthy pattern: use Task for clearly scoped, independent work—not as a crutch.

4. oracle as last resort

evidence: FRUSTRATED threads use oracle MORE (46.2%) than RESOLVED (25%). oracle is reached for when already stuck.

anti-pattern:

❌ calling oracle only when things go wrong

healthy pattern: use oracle early for planning, not late for rescue.

5. changing preserved behavior

evidence: “WTF. Keep using FillVector!” — users expect existing patterns preserved unless explicitly changing.

anti-pattern:

❌ refactoring working code while fixing unrelated issue
❌ changing API signatures without explicit request
❌ "improving" existing implementations unprompted

optimal thread patterns

success predictors

metric	target	red flag
approval:steering ratio	>2:1	<1:1
thread length	26-50 turns	>100 turns without resolution
question density	<5%	>15%
steering recovery	87% next msg not steering	consecutive steerings

thread lifecycle phases

healthy flow:

1. scope definition (1-3 turns)
2. plan confirmation (user approves approach)
3. execution with incremental approval
4. verification (tests, review)
5. commit/handoff

frustrated flow (avoid):

1. vague scope
2. agent assumes approach
3. user steers
4. agent overcorrects
5. user steers again
6. thrashing continues

conversation starters matter less than follow-up

evidence: 88.7% of questions are follow-ups, not openers. threads succeed through context accumulation, not initial framing.

user-specific patterns worth noting

high-volume users (concise_commander, verbose_explorer, steady_navigator)

user	style	implication
concise_commander	20% “wait” interrupts, heavy steering	prefers explicit control; confirm before every action
steady_navigator	1% “wait”, prefers post-hoc rejection	more tolerant of autonomous action, corrects after
verbose_explorer	context/thread management focus	cares about thread organization, spawning

steering vocabulary by user

concise_commander: “wait”, “dont”, “nope”, technical corrections
verbose_explorer: “context”, “thread”, “window”, “rules” — meta-level concerns

implementation checklist

## quick reference

□ confirm before running tests/pushing
□ use specific flags, not defaults
□ verify file targets before writing
□ preserve existing behavior unless asked to change
□ seek frequent small approvals
□ spawn subtasks for parallel work
□ use oracle early for planning
□ if steering:approval drops below 1:1, pause and realign

metrics to track (if instrumented)

steering rate per thread (target: <5%)
approval:steering ratio (target: >2:1)
recovery rate after steering (target: >85%)
consecutive steering count (red flag: >2)
thread spawn depth (healthy: 2-3)

sources

patterns.json: 901 steering, 2,050 approval messages
steering-deep-dive.md: taxonomy of 1,434 steering events
thread-flow.md: outcome analysis of 4,281 threads
tool-patterns.md: 185,537 assistant messages
question-analysis.md: 4,600 question patterns
web-research-human-ai.md: academic research on iterative collaboration