assistant brevity analysis

dataset: 18,676 assistant→user message pairs across 4,656 threads

key finding: medium-length responses get the best approval rate

assistant message length	approval rate	steering rate	n
short (<1k chars)	13.4%	7.3%	15,321
medium (1-3k chars)	16.3%	6.7%	3,122
long (>3k chars)	15.9%	9.4%	233

the sweet spot appears to be 1-3k characters. shorter isn’t necessarily better—medium responses get ~22% more approvals than short ones.

long responses show elevated steering (9.4% vs 6.7% for medium), suggesting users correct overly verbose replies.

message length preceding different user response types

user response	avg chars preceding	median	count
APPROVAL	713	467	2,597
QUESTION	646	442	4,035
STEERING	632	321	1,350
NEUTRAL	573	323	10,648

approvals follow LONGER messages on average (713 chars, median 467). this contradicts naive “shorter is better” intuition. users approve when they get sufficient detail.

steering follows messages with lower median (321) but similar average (632), suggesting high variance—steering happens after both very short (insufficient) and very long (excessive) responses.

thread-level outcomes by avg assistant length

avg length bucket	threads	steering/thread	approval/thread	resolved %
<500	1,868	0.15	0.37	32%
500-1k	1,969	0.47	0.89	54%
1k-2k	682	0.37	0.64	51%
2k-5k	127	0.22	0.45	42%
5k+	10	0.40	0.20	30%

500-1k is the sweet spot for threads:

highest approval rate per thread (0.89)
highest resolution rate (54%)
moderate steering (0.47)

very short responses (<500 avg) correlate with low engagement (0.37 approvals, only 32% resolved). users might abandon threads that feel too terse.

implications

brevity is not king—medium-length responses (500-1k chars avg, or ~100-200 words) outperform both extremes
steering correlates with extremes—both too short and too long trigger corrections
approval follows substance—users approve when they feel they got enough information
the “sweet spot” is ~500-1000 chars—threads with this avg length have the best outcomes

caveats

correlation not causation: harder tasks might require longer responses AND cause more steering
message length might be confounded with task type (debugging vs quick lookup)
labels are heuristic-based, not human-validated