counter-intuitive findings

patterns from 4,656 threads that contradict common assumptions about human-AI collaboration.

1. MORE CONTEXT ≠ BETTER OUTCOMES

assumption: longer, more detailed prompts should reduce ambiguity and improve results.

reality: >1500 char opening messages cause 2x MORE steering than 300-700 char messages.

prompt length	avg turns	avg steering
medium (300-700)	37.2	0.21
detailed (700-1500)	36.7	0.20
comprehensive (>1500)	71.8	0.55

why: overwhelming context leads to agent focusing on wrong details. key points get buried. agent scope-creeps based on mentioned-but-not-priority items.

implication: front-load PRIORITY, not VOLUME. 300-1500 chars is the goldilocks zone.

2. STEERING = SUCCESS SIGNAL, NOT FAILURE

assumption: corrections indicate the conversation is going poorly.

reality: threads WITH steering have HIGHER resolution rates than unsteered threads.

60% resolution for steered threads
37% resolution for unsteered threads
87% of steerings don’t cascade to another steering

why: steering means user is engaged and guiding. unsteered threads are often abandoned before completion. the act of correcting means the user cares enough to continue.

implication: don’t optimize to minimize steering. optimize for steering RECOVERY rate.

3. ORACLE CORRELATES WITH FRUSTRATION (but doesn’t cause it)

assumption: using oracle should improve outcomes by bringing in better reasoning.

reality: 46% of FRUSTRATED threads invoke oracle vs 25% of RESOLVED threads.

why: oracle is reached for when already stuck, not proactively. selection bias—hard tasks both frustrate AND warrant oracle. 8/14 frustrated threads never used oracle at all.

late oracle (>66% into thread) → 82.8% success rate, 0% frustration
early oracle (≤33% into thread) → 78.8% success, 1.4% frustration

implication: oracle timing matters. use for PLANNING (early-mid), not RESCUE (late). late oracle = validation/review = safe.

4. TERSE USERS OUTPERFORM VERBOSE USERS

assumption: providing more detail helps the agent understand the task.

reality: both styles can work well.

user	avg msg length	resolution rate
@concise_commander	263 chars	60.5%
@patient_pathfinder	293 chars	54.0%
@steady_navigator	547 chars	67.0%
@verbose_explorer	932 chars	83% (corrected)

update: prior analysis incorrectly classified @verbose_explorer’s spawned subagent threads as failures. verbose context actually enables effective spawn orchestration (231 subagents at 97.8% success).

implication: both styles work — terse for socratic iteration, verbose for spawn context.

5. EVENING WORK IS DRAMATICALLY WORSE

assumption: productivity depends on the task, not the clock.

reality: evening (6-9pm) shows 27.5% resolution. late night (2-5am) shows 60.4%.

time block	resolution %
late night (2-5am)	60.4%
morning (6-9am)	59.6%
evening (6-9pm)	27.5%

why: evening = busiest time (peak usage) but also fatigue accumulation. morning and late night = self-selected focus time. evening threads may be more exploratory, speculative, “let me try this” threads that don’t reach closure.

implication: schedule critical work for morning. avoid evening for important tasks. late night works if you’re the type to do late night work.

6. WEEKEND WORK OUTPERFORMS WEEKDAY

assumption: weekday focus > weekend side projects.

reality: weekend resolution 48.9% vs weekday 43.7% (+5.2pp premium).

why: fewer interruptions. self-selected important tasks (you don’t work weekends on unimportant stuff). more focused session intent.

implication: if something MUST succeed, consider weekend slot.

7. LOW QUESTION DENSITY = HIGHER RESOLUTION

assumption: asking more questions should clarify intent and improve alignment.

reality: threads with <5% question messages resolve at 76%. threads with >15% questions have lower resolution rates.

density	resolution rate	avg turns
high (>15%)	lower	12.3
low (<5%)	76%	105.6

why: interrogative mode ≠ execution mode. heavy questioning indicates confusion, not collaboration. low-question threads are DOING work, not figuring out what to do.

implication: use questions sparingly. decisive instructions > exploratory questions.

8. MARATHON THREADS SUCCEED MORE OFTEN

assumption: long threads indicate spinning/struggling.

reality: 26-50 turns = 75% success. <10 turns = 14% success.

@concise_commander: 69% of threads exceed 50 turns, 60% resolution rate
threads abandoned before 10 turns almost never resolve

why: short threads are often abandoned, not completed. complex tasks REQUIRE many turns. persistence correlates with success. the work doesn’t get easier by starting over.

implication: stay longer. don’t bail at first difficulty.

9. COLLABORATIVE OPENERS PRODUCE LONGEST THREADS

assumption: “we” and “let’s” indicate productive partnership.

reality: threads starting with collaborative language (“we”, “let’s”) average 249 messages—the LONGEST threads.

why: collaborative framing often accompanies vague or open-ended tasks. “let’s explore X” ≠ “fix X.” partnership language doesn’t constrain scope.

implication: collaborative ≠ efficient. imperative style (“fix X”) outperforms declarative (“i want X fixed”) and collaborative (“let’s work on X”).

10. TASK DELEGATION CORRELATES WITH FRUSTRATION

assumption: spawning sub-agents should parallelize work and improve outcomes.

reality: 61.5% of frustrated threads use Task vs 40.5% of resolved threads.

why: users reach for Task when confused or overwhelmed, not strategically. over-delegation when scope is unclear. “throw another agent at it” as escape hatch.

optimal: 2-6 Task spawns. beyond that, diminishing returns. spawn depth >10 = abandon risk.

implication: delegate with clear specs, not as panic response.

11. POLITE REQUESTS GET IGNORED MORE

assumption: politeness is neutral or positive for compliance.

reality: 12.7% compliance for polite requests (“please X”) vs 22.8% for direct verbs.

why: models may parse “please X” as softer priority. direct imperatives are unambiguous. politeness adds words that dilute the command.

implication: be direct. “fix the bug” > “please fix the bug if you can.”

12. CONSTRAINTS ARE FREQUENTLY VIOLATED

assumption: saying “only X” should limit agent behavior to X.

reality: 16.4% compliance rate for constraints. prohibitions get lost in multi-step reasoning.

why: “only” and “don’t” statements require maintaining negative constraints across context window. agents optimize for task completion, not constraint satisfaction.

implication: repeat constraints. ask agent to echo them back. monitor for violations.

13. COMMITTED THREADS ARE SHORTER THAN RESOLVED ONES

assumption: committing = completing the full task.

reality: avg COMMITTED thread: 57 turns. avg RESOLVED thread: 67.7 turns.

why: commits happen at specific checkpoints, not at task completion. “ship this part” ≠ “task is done.” threads often continue post-commit.

implication: commit early, commit often. don’t wait for “done.”

14. HANDOFFS CLUSTER IN FIRST 10 TURNS

assumption: handoffs happen when threads get stuck late.

reality: 45% of handoffs happen within first 10 turns.

why: early handoffs = task/tool mismatch, scope confusion, “wrong thread.” not failure—appropriate early termination. continuing a mismatched thread is worse than starting fresh.

implication: early bail is sometimes correct. don’t force fit.

summary table

assumption	reality	effect size
more context → better	>1500 chars → 2.6x more steering	+0.34 steering
steering = failure	steered threads resolve 60% vs 37%	+23pp
oracle = rescue	late oracle = best outcomes	82.8% success
verbose = clear	terse (263 chars) beats verbose (932 chars)	+27pp resolution
evening = fine	27.5% evening vs 60% late-night	-32pp
weekday focus	weekend +5.2pp resolution	+5.2pp
questions = alignment	low questions (<5%) = 76% resolution	+15pp
short threads = efficient	<10 turns = 14% success	-61pp vs sweet spot
delegation = parallel	over-delegation correlates with frustration	+21pp frustrated
polite = neutral	direct verbs +10pp compliance	+10pp

compiled from 4,656 threads, 208,799 messages, 20 users, 9 months of data
ann_flickerer | 2026-01-09