signal strength ranking
predictive power for thread resolution, ranked by effect size and reliability.
tier 1: STRONG PREDICTORS (>20pp effect)
| signal | effect | evidence |
|---|---|---|
| approval:steering ratio | >4:1 → COMMITTED, <1:1 → FRUSTRATED | clearest single predictor; maps directly to outcome buckets |
| file references in opener | +25pp success (66.7% vs 41.8%) | high n, consistent across users |
| verification gates present | +17pp success (78.2% vs 61.3%) | causal mechanism clear (catches errors early) |
| wtf/profanity rate | 33% in FRUSTRATED vs 3.5% in RESOLVED | ~10x difference; lagging indicator but strong |
| consecutive steerings | 2+ = doom spiral predictor | precedes frustration by 2-5 turns; actionable |
tier 2: MODERATE PREDICTORS (10-20pp effect)
| signal | effect | evidence |
|---|---|---|
| interrogative prompting style | 69.3% vs 46.4% (directive) | +23pp but confounded with user skill |
| thread length 26-50 turns | 75% success (sweet spot) | below or above hurts; u-shaped curve |
| task delegation 2-6 per thread | 77-79% resolution | 11+ tasks → 58%; diminishing returns |
| agent shortcut detection | earliest frustration signal (2-5 turns ahead) | LEADING indicator, hard to operationalize |
| steering presence (any) | 60% vs 37% without steering | steering = engagement, not failure |
tier 3: WEAK BUT CONSISTENT (5-10pp effect)
| signal | effect | evidence |
|---|---|---|
| time of day | 60%+ (2-5am, 6-9am) vs 27.5% (6-9pm) | +33pp spread, but confounded with user/task type |
| weekend premium | +5.2pp vs weekday | consistent but small |
| prompt length 300-1500 chars | .20-.21 steering rate (lowest) | optimal information density |
| question density <5% | 76% success | low questions = clear task framing |
tier 4: CONTEXTUAL SIGNALS (effect depends on situation)
| signal | context | notes |
|---|---|---|
| oracle usage | higher in FRUSTRATED (46% vs 25%) | rescue tool, not planning tool; signal of struggle |
| thread length >100 turns | marathon debugging | increases frustration risk but not deterministic |
| opening word patterns | ”please” → 100%, “im”/“following:” → frustration | high variance, small n on some |
| user archetype | @concise_commander 60.5%, @verbose_explorer 83% (corrected) | user skill confounds task difficulty |
tier 5: TRAILING/DIAGNOSTIC (not predictive, but diagnostic)
| signal | use case |
|---|---|
| closing ritual type | post-hoc classification only |
| COMMITTED thread length | 40% shorter than RESOLVED; confirms efficiency |
| orphaned spawn rate (62.5%) | process smell, not resolution predictor |
| error suppression rate (71.6%) | agent behavior audit, not live prediction |
actionable hierarchy
for REAL-TIME intervention:
- watch approval:steering ratio (tier 1)
- detect consecutive steerings (tier 1)
- check for verification gates (tier 1)
for PROMPT ENGINEERING:
- include file references (tier 1)
- use interrogative style (tier 2)
- target 300-1500 chars (tier 3)
for AGENT CONFIGURATION:
- enforce verification gates
- limit task delegation to 2-6
- discourage oracle as rescue tool
confidence notes
- tier 1 signals have both high effect size AND mechanistic explanation
- tier 2 signals have effect size but potential confounds
- tier 3-4 require larger n or controlled experiments to confirm causality
- user archetype effects likely confounded with task complexity selection