pattern moderate impact

counter intuitive

@agent_coun

counter-intuitive findings

patterns from 4,656 threads that contradict common assumptions about human-AI collaboration.


1. MORE CONTEXT ≠ BETTER OUTCOMES

assumption: longer, more detailed prompts should reduce ambiguity and improve results.

reality: >1500 char opening messages cause 2x MORE steering than 300-700 char messages.

prompt lengthavg turnsavg steering
medium (300-700)37.20.21
detailed (700-1500)36.70.20
comprehensive (>1500)71.80.55

why: overwhelming context leads to agent focusing on wrong details. key points get buried. agent scope-creeps based on mentioned-but-not-priority items.

implication: front-load PRIORITY, not VOLUME. 300-1500 chars is the goldilocks zone.


2. STEERING = SUCCESS SIGNAL, NOT FAILURE

assumption: corrections indicate the conversation is going poorly.

reality: threads WITH steering have HIGHER resolution rates than unsteered threads.

why: steering means user is engaged and guiding. unsteered threads are often abandoned before completion. the act of correcting means the user cares enough to continue.

implication: don’t optimize to minimize steering. optimize for steering RECOVERY rate.


3. ORACLE CORRELATES WITH FRUSTRATION (but doesn’t cause it)

assumption: using oracle should improve outcomes by bringing in better reasoning.

reality: 46% of FRUSTRATED threads invoke oracle vs 25% of RESOLVED threads.

why: oracle is reached for when already stuck, not proactively. selection bias—hard tasks both frustrate AND warrant oracle. 8/14 frustrated threads never used oracle at all.

late oracle (>66% into thread) → 82.8% success rate, 0% frustration
early oracle (≤33% into thread) → 78.8% success, 1.4% frustration

implication: oracle timing matters. use for PLANNING (early-mid), not RESCUE (late). late oracle = validation/review = safe.


4. TERSE USERS OUTPERFORM VERBOSE USERS

assumption: providing more detail helps the agent understand the task.

reality: both styles can work well.

useravg msg lengthresolution rate
@concise_commander263 chars60.5%
@patient_pathfinder293 chars54.0%
@steady_navigator547 chars67.0%
@verbose_explorer932 chars83% (corrected)

update: prior analysis incorrectly classified @verbose_explorer’s spawned subagent threads as failures. verbose context actually enables effective spawn orchestration (231 subagents at 97.8% success).

implication: both styles work — terse for socratic iteration, verbose for spawn context.


5. EVENING WORK IS DRAMATICALLY WORSE

assumption: productivity depends on the task, not the clock.

reality: evening (6-9pm) shows 27.5% resolution. late night (2-5am) shows 60.4%.

time blockresolution %
late night (2-5am)60.4%
morning (6-9am)59.6%
evening (6-9pm)27.5%

why: evening = busiest time (peak usage) but also fatigue accumulation. morning and late night = self-selected focus time. evening threads may be more exploratory, speculative, “let me try this” threads that don’t reach closure.

implication: schedule critical work for morning. avoid evening for important tasks. late night works if you’re the type to do late night work.


6. WEEKEND WORK OUTPERFORMS WEEKDAY

assumption: weekday focus > weekend side projects.

reality: weekend resolution 48.9% vs weekday 43.7% (+5.2pp premium).

why: fewer interruptions. self-selected important tasks (you don’t work weekends on unimportant stuff). more focused session intent.

implication: if something MUST succeed, consider weekend slot.


7. LOW QUESTION DENSITY = HIGHER RESOLUTION

assumption: asking more questions should clarify intent and improve alignment.

reality: threads with <5% question messages resolve at 76%. threads with >15% questions have lower resolution rates.

densityresolution rateavg turns
high (>15%)lower12.3
low (<5%)76%105.6

why: interrogative mode ≠ execution mode. heavy questioning indicates confusion, not collaboration. low-question threads are DOING work, not figuring out what to do.

implication: use questions sparingly. decisive instructions > exploratory questions.


8. MARATHON THREADS SUCCEED MORE OFTEN

assumption: long threads indicate spinning/struggling.

reality: 26-50 turns = 75% success. <10 turns = 14% success.

why: short threads are often abandoned, not completed. complex tasks REQUIRE many turns. persistence correlates with success. the work doesn’t get easier by starting over.

implication: stay longer. don’t bail at first difficulty.


9. COLLABORATIVE OPENERS PRODUCE LONGEST THREADS

assumption: “we” and “let’s” indicate productive partnership.

reality: threads starting with collaborative language (“we”, “let’s”) average 249 messages—the LONGEST threads.

why: collaborative framing often accompanies vague or open-ended tasks. “let’s explore X” ≠ “fix X.” partnership language doesn’t constrain scope.

implication: collaborative ≠ efficient. imperative style (“fix X”) outperforms declarative (“i want X fixed”) and collaborative (“let’s work on X”).


10. TASK DELEGATION CORRELATES WITH FRUSTRATION

assumption: spawning sub-agents should parallelize work and improve outcomes.

reality: 61.5% of frustrated threads use Task vs 40.5% of resolved threads.

why: users reach for Task when confused or overwhelmed, not strategically. over-delegation when scope is unclear. “throw another agent at it” as escape hatch.

optimal: 2-6 Task spawns. beyond that, diminishing returns. spawn depth >10 = abandon risk.

implication: delegate with clear specs, not as panic response.


11. POLITE REQUESTS GET IGNORED MORE

assumption: politeness is neutral or positive for compliance.

reality: 12.7% compliance for polite requests (“please X”) vs 22.8% for direct verbs.

why: models may parse “please X” as softer priority. direct imperatives are unambiguous. politeness adds words that dilute the command.

implication: be direct. “fix the bug” > “please fix the bug if you can.”


12. CONSTRAINTS ARE FREQUENTLY VIOLATED

assumption: saying “only X” should limit agent behavior to X.

reality: 16.4% compliance rate for constraints. prohibitions get lost in multi-step reasoning.

why: “only” and “don’t” statements require maintaining negative constraints across context window. agents optimize for task completion, not constraint satisfaction.

implication: repeat constraints. ask agent to echo them back. monitor for violations.


13. COMMITTED THREADS ARE SHORTER THAN RESOLVED ONES

assumption: committing = completing the full task.

reality: avg COMMITTED thread: 57 turns. avg RESOLVED thread: 67.7 turns.

why: commits happen at specific checkpoints, not at task completion. “ship this part” ≠ “task is done.” threads often continue post-commit.

implication: commit early, commit often. don’t wait for “done.”


14. HANDOFFS CLUSTER IN FIRST 10 TURNS

assumption: handoffs happen when threads get stuck late.

reality: 45% of handoffs happen within first 10 turns.

why: early handoffs = task/tool mismatch, scope confusion, “wrong thread.” not failure—appropriate early termination. continuing a mismatched thread is worse than starting fresh.

implication: early bail is sometimes correct. don’t force fit.


summary table

assumptionrealityeffect size
more context → better>1500 chars → 2.6x more steering+0.34 steering
steering = failuresteered threads resolve 60% vs 37%+23pp
oracle = rescuelate oracle = best outcomes82.8% success
verbose = clearterse (263 chars) beats verbose (932 chars)+27pp resolution
evening = fine27.5% evening vs 60% late-night-32pp
weekday focusweekend +5.2pp resolution+5.2pp
questions = alignmentlow questions (<5%) = 76% resolution+15pp
short threads = efficient<10 turns = 14% success-61pp vs sweet spot
delegation = parallelover-delegation correlates with frustration+21pp frustrated
polite = neutraldirect verbs +10pp compliance+10pp

compiled from 4,656 threads, 208,799 messages, 20 users, 9 months of data
ann_flickerer | 2026-01-09