open questions: gaps in the analysis

the analysis is extensive (4,656 threads, 208,799 messages, ~100 insight files) but significant gaps remain. organized by severity.

CAUSAL DIRECTION UNKNOWN

these correlations are documented but causation is unclear:

1. oracle usage → frustration

finding: 46% of FRUSTRATED threads use oracle vs 25% of RESOLVED
open question: does oracle usage CAUSE worse outcomes, or do users reach for oracle BECAUSE they’re already stuck?
implication: if oracle-early helps, current guidance (“use oracle for planning”) is right. if oracle is just a marker, guidance is misleading.
test needed: A/B on forced oracle usage at thread start vs organic usage

2. terse style → success

finding: concise_commander’s terse style (263 chars) correlates with 60% resolution
open question: does terse prompting CAUSE success, or do expert users happen to be terse?
implication: if terse = skill proxy, telling novices to be terse won’t help
test needed: within-user analysis of terse vs verbose prompts for same task type

3. time-of-day effects

finding: 60% resolution at 2-5am vs 27.5% at 6-9pm
open question: is this about TIME (cognitive fatigue) or USER COMPOSITION (who works late)?
implication: current “avoid evening” advice may be confounded
test needed: per-user time-of-day analysis to control for user effects

4. steering = engagement

finding: 60% resolution with steering vs 37% without
open question: does steering CAUSE success (correction mechanism), or is steering a proxy for user engagement/persistence?
implication: affects whether we should encourage more steering or just view it as noise

SAMPLE SIZE CONCERNS

5. FRUSTRATED sample is tiny (n=14)

the entire failure autopsy is based on 14 threads
patterns like TEST_WEAKENING, HACKING_AROUND may be anecdotal
question: are there more frustrated threads mislabeled as UNKNOWN or HANDOFF?
question: what’s the false negative rate of the frustration detector?

6. low-activity user patterns are speculation

users with <50 threads (feature_lead, precision_pilot, patient_pathfinder) have thin data
“archetype” assignments for these users may be overfitting to noise
question: how stable are these patterns with more data?

7. skill usage is near-zero for most skills

dig skill: 1 invocation (literally ONE)
write, document, clean-copy: single digits
question: is the “skills are underutilized” finding real, or are skills just bad?

METHODOLOGY GAPS

8. outcome labeling is heuristic

RESOLVED/COMMITTED/FRUSTRATED assigned by keyword detection + turn count heuristics
no manual validation of labels (no ground truth audit)
question: what’s the precision/recall of each label?
question: how many RESOLVED threads actually failed after the thread ended?

9. “success” definition is thread-bounded

a thread can be RESOLVED but:
- the code it produced may have been reverted
- the solution may have introduced new bugs
- the user may have re-opened a new thread on the same issue
question: what % of RESOLVED threads have follow-up threads on the same problem?

10. cross-thread continuity not analyzed

many threads reference prior threads (“continuing from T-xxx”)
these chains are not reconstructed
question: do multi-thread chains have different success patterns than isolated threads?
question: is “handoff” actually a failure or a healthy delegation pattern?

11. no semantic task clustering

threads analyzed by surface patterns (length, steering, tools)
no clustering by TASK TYPE (bug fix vs feature vs refactor vs exploration)
question: do success patterns differ fundamentally by task type?

12. agent model not controlled for

data spans may 2025 – jan 2026
amp’s underlying model likely changed during this period
question: are improvements in metrics (e.g., verbose_explorer’s learning curve) user learning or model improvement?

UNEXPLORED TERRITORIES

13. code quality not measured

no static analysis of code produced by threads
question: do high-steering threads produce BETTER code despite friction?
question: do fast-resolved threads produce more tech debt?

14. git outcomes not linked

threads produce commits, but commit outcomes (reverted? CI failed? merged?) not tracked
question: what’s the correlation between thread outcome and CI/merge success?

15. external context not captured

user may have been on-call, in a meeting, multitasking
question: how much variance is explained by factors outside the thread?

16. user intent not validated

we infer intent from opener, but don’t validate
question: do users feel RESOLVED threads actually resolved their problem?

17. multimodal inputs not analyzed

users attach screenshots, images, PDFs
question: does attachment usage correlate with success?
question: are certain attachment types (screenshot vs diagram) more effective?

18. repo/domain context not controlled

success rates conflate task difficulty with user skill
question: is concise_commander’s 60% resolution rate because he’s good, or because query_engine codebase is well-suited to amp?

ACTIONABILITY QUESTIONS

19. intervention effectiveness unknown

we recommend “pause after 2 consecutive steerings”
question: has anyone tested if interventions actually help?
question: would showing users their approval:steering ratio change behavior?

20. generalizability uncertain

all data is from one team/org using amp
question: do these patterns hold for different codebases, languages, team sizes?

PRIORITY RANKING

if further analysis time is available, prioritize:

outcome label audit (manual sample validation) — affects credibility of all findings
within-user time-of-day — controls for confounds on temporal recommendations
cross-thread chaining — handoff may not be failure
git/CI outcome linkage — ground truth for “success”
task type clustering — bug fix vs feature vs refactor have different dynamics

compiled by clint_sparklespark | 2026-01-09 corpus: 4,656 threads | 208,799 messages | 20 users | may 2025 – jan 2026