assistant brevity analysis
dataset: 18,676 assistant→user message pairs across 4,656 threads
key finding: medium-length responses get the best approval rate
| assistant message length | approval rate | steering rate | n |
|---|---|---|---|
| short (<1k chars) | 13.4% | 7.3% | 15,321 |
| medium (1-3k chars) | 16.3% | 6.7% | 3,122 |
| long (>3k chars) | 15.9% | 9.4% | 233 |
the sweet spot appears to be 1-3k characters. shorter isn’t necessarily better—medium responses get ~22% more approvals than short ones.
long responses show elevated steering (9.4% vs 6.7% for medium), suggesting users correct overly verbose replies.
message length preceding different user response types
| user response | avg chars preceding | median | count |
|---|---|---|---|
| APPROVAL | 713 | 467 | 2,597 |
| QUESTION | 646 | 442 | 4,035 |
| STEERING | 632 | 321 | 1,350 |
| NEUTRAL | 573 | 323 | 10,648 |
approvals follow LONGER messages on average (713 chars, median 467). this contradicts naive “shorter is better” intuition. users approve when they get sufficient detail.
steering follows messages with lower median (321) but similar average (632), suggesting high variance—steering happens after both very short (insufficient) and very long (excessive) responses.
thread-level outcomes by avg assistant length
| avg length bucket | threads | steering/thread | approval/thread | resolved % |
|---|---|---|---|---|
| <500 | 1,868 | 0.15 | 0.37 | 32% |
| 500-1k | 1,969 | 0.47 | 0.89 | 54% |
| 1k-2k | 682 | 0.37 | 0.64 | 51% |
| 2k-5k | 127 | 0.22 | 0.45 | 42% |
| 5k+ | 10 | 0.40 | 0.20 | 30% |
500-1k is the sweet spot for threads:
- highest approval rate per thread (0.89)
- highest resolution rate (54%)
- moderate steering (0.47)
very short responses (<500 avg) correlate with low engagement (0.37 approvals, only 32% resolved). users might abandon threads that feel too terse.
implications
- brevity is not king—medium-length responses (500-1k chars avg, or ~100-200 words) outperform both extremes
- steering correlates with extremes—both too short and too long trigger corrections
- approval follows substance—users approve when they feel they got enough information
- the “sweet spot” is ~500-1000 chars—threads with this avg length have the best outcomes
caveats
- correlation not causation: harder tasks might require longer responses AND cause more steering
- message length might be confounded with task type (debugging vs quick lookup)
- labels are heuristic-based, not human-validated