complexity estimation from opener characteristics

analysis of 4,281 threads to predict thread complexity (length, steering) from first message features.

key finding: complexity is predictable from openers

opener characteristics correlate strongly with thread outcomes. specific signals predict both thread length and steering requirements.

strongest complexity predictors

feature	avg turns WITH	avg turns WITHOUT	delta	signal direction
`is_collaborative` (“we”, “let’s”)	91.9	47.4	+44.5	long threads
`is_directive` (“you”, “your”)	69.1	48.4	+20.7	long threads
`has_url`	35.1	50.8	-15.7	short threads
`is_polite` (“please”)	36.4	51.1	-14.7	short threads
`has_code_block`	61.7	47.7	+14.1	long threads
`has_file_ref`	56.7	39.2	+17.4	long threads

interpretation

collaborative framing (“let’s”, “we”) predicts marathons. avg 91.9 turns vs 47.4. these threads imply iterative work.
directive framing (“you are X”) predicts longer threads (69.1 avg). typically spawned sub-agents with complex tasks.
polite framing (“please X”) predicts SHORT threads (36.4 avg). simple requests, quick resolution.
URL presence predicts shorter threads (35.1 avg). often research/reading tasks, not implementation.

first word as complexity signal

first word	count	avg turns	avg steering rate
we’re	24	133.7	0.0135
your	20	129.3	0.0178
let’s	45	114.4	0.0175
summarize	41	83.2	0.0124
implement	35	74.1	0.0064
continuing	1,502	53.8	0.0100
please	667	36.4	0.0049
migrate	33	17.1	n/a
using	34	17.1	n/a

complexity tiers by first word

marathon signals (100+ avg turns):

“we’re” (133.7) - session framing, extended work
“your” (129.3) - spawned agent instructions
“let’s” (114.4) - collaborative iteration

medium signals (50-100 avg turns):

“summarize” (83.2) - research + synthesis
“implement” (74.1) - feature work
“review” (56.4) - review cycles

quick signals (<40 avg turns):

“please” (36.4) - polite quick requests
“migrate” (17.1) - scripted/scoped tasks
“using” (17.1) - tool-specific queries

opener length vs complexity

length bucket	count	avg turns	avg steering
tiny (<100 chars)	504	49.9	0.0119
short (100-300)	925	44.5	0.0112
medium (300-600)	767	36.8	0.0058
long (600-1500)	956	35.6	0.0061
verbose (1500+)	1,129	71.0	0.0140

sweet spot: 300-1500 chars

lowest steering rate (0.58-0.61%)
shortest threads (35-37 avg turns)
enough context to be clear, not so much to create confusion

u-shaped curve

tiny prompts → medium threads + higher steering (ambiguous)
medium prompts → shortest threads + lowest steering (goldilocks)
verbose prompts → longest threads + highest steering (overwhelming context or complex tasks)

feature prevalence by complexity bucket

feature	tiny (1-10)	small (11-25)	medium (26-50)	large (51-100)	marathon (100+)
has_file_ref	35.6%	53.5%	65.5%	70.2%	64.3%
has_continuing	33.4%	24.8%	30.2%	45.5%	44.2%
is_polite	15.1%	19.0%	22.8%	14.0%	6.4%
is_collaborative	1.5%	2.3%	2.4%	5.1%	6.1%
mentions_test	43.6%	42.9%	54.3%	63.4%	64.0%
has_list	39.4%	42.0%	45.1%	55.0%	52.0%

patterns

file refs increase with complexity - peaks at large (70.2%), still high in marathon (64.3%)
politeness decreases with complexity - 19% in small, drops to 6.4% in marathon
collaborative language increases with complexity - 1.5% tiny → 6.1% marathon
test mentions increase with complexity - complex tasks involve more testing

steering predictors

feature	steering WITH	steering WITHOUT	delta
is_collaborative	0.0169	0.0097	+74%
is_polite	0.0049	0.0108	-55%
is_directive	0.0063	0.0100	-37%
has_file_ref	0.0116	0.0078	+49%
is_question	0.0137	0.0097	+41%

insights

polite openers reduce steering by 55% - clear intent, agent knows what to do
collaborative framing increases steering by 74% - implies back-and-forth, more intervention
questions increase steering by 41% - exploratory threads need more guidance

practical complexity estimation heuristic

if first_word in ["we're", "your", "let's"]:
    expect = "marathon (100+ turns)"
elif first_word == "please":
    expect = "quick (30-40 turns)"
elif first_word == "continuing":
    expect = "medium-long (50-60 turns)"
elif first_word in ["migrate", "using"]:
    expect = "very quick (<20 turns)"

if length > 1500:
    expect += " +15 turns (verbose penalty)"
elif 300 < length < 1500:
    expect += " -10 turns (sweet spot)"

if has_file_ref:
    expect += " +17 turns"
if is_collaborative:
    expect += " +44 turns"
if is_polite:
    expect -= " 15 turns"

recommendations for prompt design

want quick resolution? start with “please”, keep under 600 chars
expect iteration? use collaborative language (“let’s”, “we”) and budget for marathon
spawning agents? “your” framing predicts long threads (129 avg) - scope carefully
sweet spot for context: 300-1500 chars, include file refs, structured lists

data quality notes

4,281 threads analyzed with opener extraction
steering/approval counts from labeling pass
some threads lack content files (excluded from analysis)
“continuing” threads (35% of corpus) are continuations, which may inflate their turn counts