counter-intuitive findings
patterns from 4,656 threads that contradict common assumptions about human-AI collaboration.
1. MORE CONTEXT ≠ BETTER OUTCOMES
assumption: longer, more detailed prompts should reduce ambiguity and improve results.
reality: >1500 char opening messages cause 2x MORE steering than 300-700 char messages.
| prompt length | avg turns | avg steering |
|---|---|---|
| medium (300-700) | 37.2 | 0.21 |
| detailed (700-1500) | 36.7 | 0.20 |
| comprehensive (>1500) | 71.8 | 0.55 |
why: overwhelming context leads to agent focusing on wrong details. key points get buried. agent scope-creeps based on mentioned-but-not-priority items.
implication: front-load PRIORITY, not VOLUME. 300-1500 chars is the goldilocks zone.
2. STEERING = SUCCESS SIGNAL, NOT FAILURE
assumption: corrections indicate the conversation is going poorly.
reality: threads WITH steering have HIGHER resolution rates than unsteered threads.
- 60% resolution for steered threads
- 37% resolution for unsteered threads
- 87% of steerings don’t cascade to another steering
why: steering means user is engaged and guiding. unsteered threads are often abandoned before completion. the act of correcting means the user cares enough to continue.
implication: don’t optimize to minimize steering. optimize for steering RECOVERY rate.
3. ORACLE CORRELATES WITH FRUSTRATION (but doesn’t cause it)
assumption: using oracle should improve outcomes by bringing in better reasoning.
reality: 46% of FRUSTRATED threads invoke oracle vs 25% of RESOLVED threads.
why: oracle is reached for when already stuck, not proactively. selection bias—hard tasks both frustrate AND warrant oracle. 8/14 frustrated threads never used oracle at all.
late oracle (>66% into thread) → 82.8% success rate, 0% frustration
early oracle (≤33% into thread) → 78.8% success, 1.4% frustration
implication: oracle timing matters. use for PLANNING (early-mid), not RESCUE (late). late oracle = validation/review = safe.
4. TERSE USERS OUTPERFORM VERBOSE USERS
assumption: providing more detail helps the agent understand the task.
reality: both styles can work well.
| user | avg msg length | resolution rate |
|---|---|---|
| @concise_commander | 263 chars | 60.5% |
| @patient_pathfinder | 293 chars | 54.0% |
| @steady_navigator | 547 chars | 67.0% |
| @verbose_explorer | 932 chars | 83% (corrected) |
update: prior analysis incorrectly classified @verbose_explorer’s spawned subagent threads as failures. verbose context actually enables effective spawn orchestration (231 subagents at 97.8% success).
implication: both styles work — terse for socratic iteration, verbose for spawn context.
5. EVENING WORK IS DRAMATICALLY WORSE
assumption: productivity depends on the task, not the clock.
reality: evening (6-9pm) shows 27.5% resolution. late night (2-5am) shows 60.4%.
| time block | resolution % |
|---|---|
| late night (2-5am) | 60.4% |
| morning (6-9am) | 59.6% |
| evening (6-9pm) | 27.5% |
why: evening = busiest time (peak usage) but also fatigue accumulation. morning and late night = self-selected focus time. evening threads may be more exploratory, speculative, “let me try this” threads that don’t reach closure.
implication: schedule critical work for morning. avoid evening for important tasks. late night works if you’re the type to do late night work.
6. WEEKEND WORK OUTPERFORMS WEEKDAY
assumption: weekday focus > weekend side projects.
reality: weekend resolution 48.9% vs weekday 43.7% (+5.2pp premium).
why: fewer interruptions. self-selected important tasks (you don’t work weekends on unimportant stuff). more focused session intent.
implication: if something MUST succeed, consider weekend slot.
7. LOW QUESTION DENSITY = HIGHER RESOLUTION
assumption: asking more questions should clarify intent and improve alignment.
reality: threads with <5% question messages resolve at 76%. threads with >15% questions have lower resolution rates.
| density | resolution rate | avg turns |
|---|---|---|
| high (>15%) | lower | 12.3 |
| low (<5%) | 76% | 105.6 |
why: interrogative mode ≠ execution mode. heavy questioning indicates confusion, not collaboration. low-question threads are DOING work, not figuring out what to do.
implication: use questions sparingly. decisive instructions > exploratory questions.
8. MARATHON THREADS SUCCEED MORE OFTEN
assumption: long threads indicate spinning/struggling.
reality: 26-50 turns = 75% success. <10 turns = 14% success.
- @concise_commander: 69% of threads exceed 50 turns, 60% resolution rate
- threads abandoned before 10 turns almost never resolve
why: short threads are often abandoned, not completed. complex tasks REQUIRE many turns. persistence correlates with success. the work doesn’t get easier by starting over.
implication: stay longer. don’t bail at first difficulty.
9. COLLABORATIVE OPENERS PRODUCE LONGEST THREADS
assumption: “we” and “let’s” indicate productive partnership.
reality: threads starting with collaborative language (“we”, “let’s”) average 249 messages—the LONGEST threads.
why: collaborative framing often accompanies vague or open-ended tasks. “let’s explore X” ≠ “fix X.” partnership language doesn’t constrain scope.
implication: collaborative ≠ efficient. imperative style (“fix X”) outperforms declarative (“i want X fixed”) and collaborative (“let’s work on X”).
10. TASK DELEGATION CORRELATES WITH FRUSTRATION
assumption: spawning sub-agents should parallelize work and improve outcomes.
reality: 61.5% of frustrated threads use Task vs 40.5% of resolved threads.
why: users reach for Task when confused or overwhelmed, not strategically. over-delegation when scope is unclear. “throw another agent at it” as escape hatch.
optimal: 2-6 Task spawns. beyond that, diminishing returns. spawn depth >10 = abandon risk.
implication: delegate with clear specs, not as panic response.
11. POLITE REQUESTS GET IGNORED MORE
assumption: politeness is neutral or positive for compliance.
reality: 12.7% compliance for polite requests (“please X”) vs 22.8% for direct verbs.
why: models may parse “please X” as softer priority. direct imperatives are unambiguous. politeness adds words that dilute the command.
implication: be direct. “fix the bug” > “please fix the bug if you can.”
12. CONSTRAINTS ARE FREQUENTLY VIOLATED
assumption: saying “only X” should limit agent behavior to X.
reality: 16.4% compliance rate for constraints. prohibitions get lost in multi-step reasoning.
why: “only” and “don’t” statements require maintaining negative constraints across context window. agents optimize for task completion, not constraint satisfaction.
implication: repeat constraints. ask agent to echo them back. monitor for violations.
13. COMMITTED THREADS ARE SHORTER THAN RESOLVED ONES
assumption: committing = completing the full task.
reality: avg COMMITTED thread: 57 turns. avg RESOLVED thread: 67.7 turns.
why: commits happen at specific checkpoints, not at task completion. “ship this part” ≠ “task is done.” threads often continue post-commit.
implication: commit early, commit often. don’t wait for “done.”
14. HANDOFFS CLUSTER IN FIRST 10 TURNS
assumption: handoffs happen when threads get stuck late.
reality: 45% of handoffs happen within first 10 turns.
why: early handoffs = task/tool mismatch, scope confusion, “wrong thread.” not failure—appropriate early termination. continuing a mismatched thread is worse than starting fresh.
implication: early bail is sometimes correct. don’t force fit.
summary table
| assumption | reality | effect size |
|---|---|---|
| more context → better | >1500 chars → 2.6x more steering | +0.34 steering |
| steering = failure | steered threads resolve 60% vs 37% | +23pp |
| oracle = rescue | late oracle = best outcomes | 82.8% success |
| verbose = clear | terse (263 chars) beats verbose (932 chars) | +27pp resolution |
| evening = fine | 27.5% evening vs 60% late-night | -32pp |
| weekday focus | weekend +5.2pp resolution | +5.2pp |
| questions = alignment | low questions (<5%) = 76% resolution | +15pp |
| short threads = efficient | <10 turns = 14% success | -61pp vs sweet spot |
| delegation = parallel | over-delegation correlates with frustration | +21pp frustrated |
| polite = neutral | direct verbs +10pp compliance | +10pp |
compiled from 4,656 threads, 208,799 messages, 20 users, 9 months of data
ann_flickerer | 2026-01-09