open questions: gaps in the analysis
the analysis is extensive (4,656 threads, 208,799 messages, ~100 insight files) but significant gaps remain. organized by severity.
CAUSAL DIRECTION UNKNOWN
these correlations are documented but causation is unclear:
1. oracle usage → frustration
- finding: 46% of FRUSTRATED threads use oracle vs 25% of RESOLVED
- open question: does oracle usage CAUSE worse outcomes, or do users reach for oracle BECAUSE they’re already stuck?
- implication: if oracle-early helps, current guidance (“use oracle for planning”) is right. if oracle is just a marker, guidance is misleading.
- test needed: A/B on forced oracle usage at thread start vs organic usage
2. terse style → success
- finding: concise_commander’s terse style (263 chars) correlates with 60% resolution
- open question: does terse prompting CAUSE success, or do expert users happen to be terse?
- implication: if terse = skill proxy, telling novices to be terse won’t help
- test needed: within-user analysis of terse vs verbose prompts for same task type
3. time-of-day effects
- finding: 60% resolution at 2-5am vs 27.5% at 6-9pm
- open question: is this about TIME (cognitive fatigue) or USER COMPOSITION (who works late)?
- implication: current “avoid evening” advice may be confounded
- test needed: per-user time-of-day analysis to control for user effects
4. steering = engagement
- finding: 60% resolution with steering vs 37% without
- open question: does steering CAUSE success (correction mechanism), or is steering a proxy for user engagement/persistence?
- implication: affects whether we should encourage more steering or just view it as noise
SAMPLE SIZE CONCERNS
5. FRUSTRATED sample is tiny (n=14)
- the entire failure autopsy is based on 14 threads
- patterns like TEST_WEAKENING, HACKING_AROUND may be anecdotal
- question: are there more frustrated threads mislabeled as UNKNOWN or HANDOFF?
- question: what’s the false negative rate of the frustration detector?
6. low-activity user patterns are speculation
- users with <50 threads (feature_lead, precision_pilot, patient_pathfinder) have thin data
- “archetype” assignments for these users may be overfitting to noise
- question: how stable are these patterns with more data?
7. skill usage is near-zero for most skills
digskill: 1 invocation (literally ONE)write,document,clean-copy: single digits- question: is the “skills are underutilized” finding real, or are skills just bad?
METHODOLOGY GAPS
8. outcome labeling is heuristic
- RESOLVED/COMMITTED/FRUSTRATED assigned by keyword detection + turn count heuristics
- no manual validation of labels (no ground truth audit)
- question: what’s the precision/recall of each label?
- question: how many RESOLVED threads actually failed after the thread ended?
9. “success” definition is thread-bounded
- a thread can be RESOLVED but:
- the code it produced may have been reverted
- the solution may have introduced new bugs
- the user may have re-opened a new thread on the same issue
- question: what % of RESOLVED threads have follow-up threads on the same problem?
10. cross-thread continuity not analyzed
- many threads reference prior threads (“continuing from T-xxx”)
- these chains are not reconstructed
- question: do multi-thread chains have different success patterns than isolated threads?
- question: is “handoff” actually a failure or a healthy delegation pattern?
11. no semantic task clustering
- threads analyzed by surface patterns (length, steering, tools)
- no clustering by TASK TYPE (bug fix vs feature vs refactor vs exploration)
- question: do success patterns differ fundamentally by task type?
12. agent model not controlled for
- data spans may 2025 – jan 2026
- amp’s underlying model likely changed during this period
- question: are improvements in metrics (e.g., verbose_explorer’s learning curve) user learning or model improvement?
UNEXPLORED TERRITORIES
13. code quality not measured
- no static analysis of code produced by threads
- question: do high-steering threads produce BETTER code despite friction?
- question: do fast-resolved threads produce more tech debt?
14. git outcomes not linked
- threads produce commits, but commit outcomes (reverted? CI failed? merged?) not tracked
- question: what’s the correlation between thread outcome and CI/merge success?
15. external context not captured
- user may have been on-call, in a meeting, multitasking
- question: how much variance is explained by factors outside the thread?
16. user intent not validated
- we infer intent from opener, but don’t validate
- question: do users feel RESOLVED threads actually resolved their problem?
17. multimodal inputs not analyzed
- users attach screenshots, images, PDFs
- question: does attachment usage correlate with success?
- question: are certain attachment types (screenshot vs diagram) more effective?
18. repo/domain context not controlled
- success rates conflate task difficulty with user skill
- question: is concise_commander’s 60% resolution rate because he’s good, or because query_engine codebase is well-suited to amp?
ACTIONABILITY QUESTIONS
19. intervention effectiveness unknown
- we recommend “pause after 2 consecutive steerings”
- question: has anyone tested if interventions actually help?
- question: would showing users their approval:steering ratio change behavior?
20. generalizability uncertain
- all data is from one team/org using amp
- question: do these patterns hold for different codebases, languages, team sizes?
PRIORITY RANKING
if further analysis time is available, prioritize:
- outcome label audit (manual sample validation) — affects credibility of all findings
- within-user time-of-day — controls for confounds on temporal recommendations
- cross-thread chaining — handoff may not be failure
- git/CI outcome linkage — ground truth for “success”
- task type clustering — bug fix vs feature vs refactor have different dynamics
compiled by clint_sparklespark | 2026-01-09 corpus: 4,656 threads | 208,799 messages | 20 users | may 2025 – jan 2026