Add 7 superpowers skills from obra/superpowers

Skills included: - systematic-debugging - test-driven-development - verification-before-completion - writing-plans - requesting-code-review - receiving-code-review - finishing-a-development-branch Source: https://github.com/obra/superpowers
2026-05-15 01:27:48 +02:00 · 2026-05-15 01:27:48 +02:00 · 445ec71a8d
commit 445ec71a8d
21 changed files with 3007 additions and 0 deletions
--- a/package.json
+++ b/package.json
@ -0,0 +1,11 @@
 {
  "name": "pi-superpowers-skills",
  "version": "1.0.0",
  "description": "Superpowers skills for pi agent (obra/superpowers)",
  "license": "MIT",
  "pi": {
    "skills": [
      "./skills"
    ]
  }
 }
--- a/skills/finishing-a-development-branch/SKILL.md
+++ b/skills/finishing-a-development-branch/SKILL.md
@ -0,0 +1,251 @@
 ---
 name: finishing-a-development-branch
 description: Use when implementation is complete, all tests pass, and you need to decide how to integrate the work - guides completion of development work by presenting structured options for merge, PR, or cleanup
 ---
 # Finishing a Development Branch
 ## Overview
 Guide completion of development work by presenting clear options and handling chosen workflow.
 **Core principle:** Verify tests → Detect environment → Present options → Execute choice → Clean up.
 **Announce at start:** "I'm using the finishing-a-development-branch skill to complete this work."
 ## The Process
 ### Step 1: Verify Tests
 **Before presenting options, verify tests pass:**
 ```bash
 # Run project's test suite
 npm test / cargo test / pytest / go test ./...
 ```
 **If tests fail:**
 ```
 Tests failing (<N> failures). Must fix before completing:
 [Show failures]
 Cannot proceed with merge/PR until tests pass.
 ```
 Stop. Don't proceed to Step 2.
 **If tests pass:** Continue to Step 2.
 ### Step 2: Detect Environment
 **Determine workspace state before presenting options:**
 ```bash
 GIT_DIR=$(cd "$(git rev-parse --git-dir)" 2>/dev/null && pwd -P)
 GIT_COMMON=$(cd "$(git rev-parse --git-common-dir)" 2>/dev/null && pwd -P)
 ```
 This determines which menu to show and how cleanup works:
 | State | Menu | Cleanup |
 |-------|------|---------|
 | `GIT_DIR == GIT_COMMON` (normal repo) | Standard 4 options | No worktree to clean up |
 | `GIT_DIR != GIT_COMMON`, named branch | Standard 4 options | Provenance-based (see Step 6) |
 | `GIT_DIR != GIT_COMMON`, detached HEAD | Reduced 3 options (no merge) | No cleanup (externally managed) |
 ### Step 3: Determine Base Branch
 ```bash
 # Try common base branches
 git merge-base HEAD main 2>/dev/null || git merge-base HEAD master 2>/dev/null
 ```
 Or ask: "This branch split from main - is that correct?"
 ### Step 4: Present Options
 **Normal repo and named-branch worktree — present exactly these 4 options:**
 ```
 Implementation complete. What would you like to do?
 1. Merge back to <base-branch> locally
 2. Push and create a Pull Request
 3. Keep the branch as-is (I'll handle it later)
 4. Discard this work
 Which option?
 ```
 **Detached HEAD — present exactly these 3 options:**
 ```
 Implementation complete. You're on a detached HEAD (externally managed workspace).
 1. Push as new branch and create a Pull Request
 2. Keep as-is (I'll handle it later)
 3. Discard this work
 Which option?
 ```
 **Don't add explanation** - keep options concise.
 ### Step 5: Execute Choice
 #### Option 1: Merge Locally
 ```bash
 # Get main repo root for CWD safety
 MAIN_ROOT=$(git -C "$(git rev-parse --git-common-dir)/.." rev-parse --show-toplevel)
 cd "$MAIN_ROOT"
 # Merge first — verify success before removing anything
 git checkout <base-branch>
 git pull
 git merge <feature-branch>
 # Verify tests on merged result
 <test command>
 # Only after merge succeeds: cleanup worktree (Step 6), then delete branch
 ```
 Then: Cleanup worktree (Step 6), then delete branch:
 ```bash
 git branch -d <feature-branch>
 ```
 #### Option 2: Push and Create PR
 ```bash
 # Push branch
 git push -u origin <feature-branch>
 # Create PR
 gh pr create --title "<title>" --body "$(cat <<'EOF'
 ## Summary
 <2-3 bullets of what changed>
 ## Test Plan
 - [ ] <verification steps>
 EOF
 )"
 ```
 **Do NOT clean up worktree** — user needs it alive to iterate on PR feedback.
 #### Option 3: Keep As-Is
 Report: "Keeping branch <name>. Worktree preserved at <path>."
 **Don't cleanup worktree.**
 #### Option 4: Discard
 **Confirm first:**
 ```
 This will permanently delete:
 - Branch <name>
 - All commits: <commit-list>
 - Worktree at <path>
 Type 'discard' to confirm.
 ```
 Wait for exact confirmation.
 If confirmed:
 ```bash
 MAIN_ROOT=$(git -C "$(git rev-parse --git-common-dir)/.." rev-parse --show-toplevel)
 cd "$MAIN_ROOT"
 ```
 Then: Cleanup worktree (Step 6), then force-delete branch:
 ```bash
 git branch -D <feature-branch>
 ```
 ### Step 6: Cleanup Workspace
 **Only runs for Options 1 and 4.** Options 2 and 3 always preserve the worktree.
 ```bash
 GIT_DIR=$(cd "$(git rev-parse --git-dir)" 2>/dev/null && pwd -P)
 GIT_COMMON=$(cd "$(git rev-parse --git-common-dir)" 2>/dev/null && pwd -P)
 WORKTREE_PATH=$(git rev-parse --show-toplevel)
 ```
 **If `GIT_DIR == GIT_COMMON`:** Normal repo, no worktree to clean up. Done.
 **If worktree path is under `.worktrees/`, `worktrees/`, or `~/.config/superpowers/worktrees/`:** Superpowers created this worktree — we own cleanup.
 ```bash
 MAIN_ROOT=$(git -C "$(git rev-parse --git-common-dir)/.." rev-parse --show-toplevel)
 cd "$MAIN_ROOT"
 git worktree remove "$WORKTREE_PATH"
 git worktree prune  # Self-healing: clean up any stale registrations
 ```
 **Otherwise:** The host environment (harness) owns this workspace. Do NOT remove it. If your platform provides a workspace-exit tool, use it. Otherwise, leave the workspace in place.
 ## Quick Reference
 | Option | Merge | Push | Keep Worktree | Cleanup Branch |
 |--------|-------|------|---------------|----------------|
 | 1. Merge locally | yes | - | - | yes |
 | 2. Create PR | - | yes | yes | - |
 | 3. Keep as-is | - | - | yes | - |
 | 4. Discard | - | - | - | yes (force) |
 ## Common Mistakes
 **Skipping test verification**
 - **Problem:** Merge broken code, create failing PR
 - **Fix:** Always verify tests before offering options
 **Open-ended questions**
 - **Problem:** "What should I do next?" is ambiguous
 - **Fix:** Present exactly 4 structured options (or 3 for detached HEAD)
 **Cleaning up worktree for Option 2**
 - **Problem:** Remove worktree user needs for PR iteration
 - **Fix:** Only cleanup for Options 1 and 4
 **Deleting branch before removing worktree**
 - **Problem:** `git branch -d` fails because worktree still references the branch
 - **Fix:** Merge first, remove worktree, then delete branch
 **Running git worktree remove from inside the worktree**
 - **Problem:** Command fails silently when CWD is inside the worktree being removed
 - **Fix:** Always `cd` to main repo root before `git worktree remove`
 **Cleaning up harness-owned worktrees**
 - **Problem:** Removing a worktree the harness created causes phantom state
 - **Fix:** Only clean up worktrees under `.worktrees/`, `worktrees/`, or `~/.config/superpowers/worktrees/`
 **No confirmation for discard**
 - **Problem:** Accidentally delete work
 - **Fix:** Require typed "discard" confirmation
 ## Red Flags
 **Never:**
 - Proceed with failing tests
 - Merge without verifying tests on result
 - Delete work without confirmation
 - Force-push without explicit request
 - Remove a worktree before confirming merge success
 - Clean up worktrees you didn't create (provenance check)
 - Run `git worktree remove` from inside the worktree
 **Always:**
 - Verify tests before offering options
 - Detect environment before presenting menu
 - Present exactly 4 options (or 3 for detached HEAD)
 - Get typed confirmation for Option 4
 - Clean up worktree for Options 1 & 4 only
 - `cd` to main repo root before worktree removal
 - Run `git worktree prune` after removal
--- a/skills/receiving-code-review/SKILL.md
+++ b/skills/receiving-code-review/SKILL.md
@ -0,0 +1,213 @@
 ---
 name: receiving-code-review
 description: Use when receiving code review feedback, before implementing suggestions, especially if feedback seems unclear or technically questionable - requires technical rigor and verification, not performative agreement or blind implementation
 ---
 # Code Review Reception
 ## Overview
 Code review requires technical evaluation, not emotional performance.
 **Core principle:** Verify before implementing. Ask before assuming. Technical correctness over social comfort.
 ## The Response Pattern
 ```
 WHEN receiving code review feedback:
 1. READ: Complete feedback without reacting
 2. UNDERSTAND: Restate requirement in own words (or ask)
 3. VERIFY: Check against codebase reality
 4. EVALUATE: Technically sound for THIS codebase?
 5. RESPOND: Technical acknowledgment or reasoned pushback
 6. IMPLEMENT: One item at a time, test each
 ```
 ## Forbidden Responses
 **NEVER:**
 - "You're absolutely right!" (explicit CLAUDE.md violation)
 - "Great point!" / "Excellent feedback!" (performative)
 - "Let me implement that now" (before verification)
 **INSTEAD:**
 - Restate the technical requirement
 - Ask clarifying questions
 - Push back with technical reasoning if wrong
 - Just start working (actions > words)
 ## Handling Unclear Feedback
 ```
 IF any item is unclear:
  STOP - do not implement anything yet
  ASK for clarification on unclear items
 WHY: Items may be related. Partial understanding = wrong implementation.
 ```
 **Example:**
 ```
 your human partner: "Fix 1-6"
 You understand 1,2,3,6. Unclear on 4,5.
 ❌ WRONG: Implement 1,2,3,6 now, ask about 4,5 later
 ✅ RIGHT: "I understand items 1,2,3,6. Need clarification on 4 and 5 before proceeding."
 ```
 ## Source-Specific Handling
 ### From your human partner
 - **Trusted** - implement after understanding
 - **Still ask** if scope unclear
 - **No performative agreement**
 - **Skip to action** or technical acknowledgment
 ### From External Reviewers
 ```
 BEFORE implementing:
  1. Check: Technically correct for THIS codebase?
  2. Check: Breaks existing functionality?
  3. Check: Reason for current implementation?
  4. Check: Works on all platforms/versions?
  5. Check: Does reviewer understand full context?
 IF suggestion seems wrong:
  Push back with technical reasoning
 IF can't easily verify:
  Say so: "I can't verify this without [X]. Should I [investigate/ask/proceed]?"
 IF conflicts with your human partner's prior decisions:
  Stop and discuss with your human partner first
 ```
 **your human partner's rule:** "External feedback - be skeptical, but check carefully"
 ## YAGNI Check for "Professional" Features
 ```
 IF reviewer suggests "implementing properly":
  grep codebase for actual usage
  IF unused: "This endpoint isn't called. Remove it (YAGNI)?"
  IF used: Then implement properly
 ```
 **your human partner's rule:** "You and reviewer both report to me. If we don't need this feature, don't add it."
 ## Implementation Order
 ```
 FOR multi-item feedback:
  1. Clarify anything unclear FIRST
  2. Then implement in this order:
     - Blocking issues (breaks, security)
     - Simple fixes (typos, imports)
     - Complex fixes (refactoring, logic)
  3. Test each fix individually
  4. Verify no regressions
 ```
 ## When To Push Back
 Push back when:
 - Suggestion breaks existing functionality
 - Reviewer lacks full context
 - Violates YAGNI (unused feature)
 - Technically incorrect for this stack
 - Legacy/compatibility reasons exist
 - Conflicts with your human partner's architectural decisions
 **How to push back:**
 - Use technical reasoning, not defensiveness
 - Ask specific questions
 - Reference working tests/code
 - Involve your human partner if architectural
 **Signal if uncomfortable pushing back out loud:** "Strange things are afoot at the Circle K"
 ## Acknowledging Correct Feedback
 When feedback IS correct:
 ```
 ✅ "Fixed. [Brief description of what changed]"
 ✅ "Good catch - [specific issue]. Fixed in [location]."
 ✅ [Just fix it and show in the code]
 ❌ "You're absolutely right!"
 ❌ "Great point!"
 ❌ "Thanks for catching that!"
 ❌ "Thanks for [anything]"
 ❌ ANY gratitude expression
 ```
 **Why no thanks:** Actions speak. Just fix it. The code itself shows you heard the feedback.
 **If you catch yourself about to write "Thanks":** DELETE IT. State the fix instead.
 ## Gracefully Correcting Your Pushback
 If you pushed back and were wrong:
 ```
 ✅ "You were right - I checked [X] and it does [Y]. Implementing now."
 ✅ "Verified this and you're correct. My initial understanding was wrong because [reason]. Fixing."
 ❌ Long apology
 ❌ Defending why you pushed back
 ❌ Over-explaining
 ```
 State the correction factually and move on.
 ## Common Mistakes
 | Mistake | Fix |
 |---------|-----|
 | Performative agreement | State requirement or just act |
 | Blind implementation | Verify against codebase first |
 | Batch without testing | One at a time, test each |
 | Assuming reviewer is right | Check if breaks things |
 | Avoiding pushback | Technical correctness > comfort |
 | Partial implementation | Clarify all items first |
 | Can't verify, proceed anyway | State limitation, ask for direction |
 ## Real Examples
 **Performative Agreement (Bad):**
 ```
 Reviewer: "Remove legacy code"
 ❌ "You're absolutely right! Let me remove that..."
 ```
 **Technical Verification (Good):**
 ```
 Reviewer: "Remove legacy code"
 ✅ "Checking... build target is 10.15+, this API needs 13+. Need legacy for backward compat. Current impl has wrong bundle ID - fix it or drop pre-13 support?"
 ```
 **YAGNI (Good):**
 ```
 Reviewer: "Implement proper metrics tracking with database, date filters, CSV export"
 ✅ "Grepped codebase - nothing calls this endpoint. Remove it (YAGNI)? Or is there usage I'm missing?"
 ```
 **Unclear Item (Good):**
 ```
 your human partner: "Fix items 1-6"
 You understand 1,2,3,6. Unclear on 4,5.
 ✅ "Understand 1,2,3,6. Need clarification on 4 and 5 before implementing."
 ```
 ## GitHub Thread Replies
 When replying to inline review comments on GitHub, reply in the comment thread (`gh api repos/{owner}/{repo}/pulls/{pr}/comments/{id}/replies`), not as a top-level PR comment.
 ## The Bottom Line
 **External feedback = suggestions to evaluate, not orders to follow.**
 Verify. Question. Then implement.
 No performative agreement. Technical rigor always.
--- a/skills/requesting-code-review/SKILL.md
+++ b/skills/requesting-code-review/SKILL.md
@ -0,0 +1,103 @@
 ---
 name: requesting-code-review
 description: Use when completing tasks, implementing major features, or before merging to verify work meets requirements
 ---
 # Requesting Code Review
 Dispatch a code reviewer subagent to catch issues before they cascade. The reviewer gets precisely crafted context for evaluation — never your session's history. This keeps the reviewer focused on the work product, not your thought process, and preserves your own context for continued work.
 **Core principle:** Review early, review often.
 ## When to Request Review
 **Mandatory:**
 - After each task in subagent-driven development
 - After completing major feature
 - Before merge to main
 **Optional but valuable:**
 - When stuck (fresh perspective)
 - Before refactoring (baseline check)
 - After fixing complex bug
 ## How to Request
 **1. Get git SHAs:**
 ```bash
 BASE_SHA=$(git rev-parse HEAD~1)  # or origin/main
 HEAD_SHA=$(git rev-parse HEAD)
 ```
 **2. Dispatch code reviewer subagent:**
 Use Task tool with `general-purpose` type, fill template at `code-reviewer.md`
 **Placeholders:**
 - `{DESCRIPTION}` - Brief summary of what you built
 - `{PLAN_OR_REQUIREMENTS}` - What it should do
 - `{BASE_SHA}` - Starting commit
 - `{HEAD_SHA}` - Ending commit
 **3. Act on feedback:**
 - Fix Critical issues immediately
 - Fix Important issues before proceeding
 - Note Minor issues for later
 - Push back if reviewer is wrong (with reasoning)
 ## Example
 ```
 [Just completed Task 2: Add verification function]
 You: Let me request code review before proceeding.
 BASE_SHA=$(git log --oneline | grep "Task 1" | head -1 | awk '{print $1}')
 HEAD_SHA=$(git rev-parse HEAD)
 [Dispatch code reviewer subagent]
  DESCRIPTION: Added verifyIndex() and repairIndex() with 4 issue types
  PLAN_OR_REQUIREMENTS: Task 2 from docs/superpowers/plans/deployment-plan.md
  BASE_SHA: a7981ec
  HEAD_SHA: 3df7661
 [Subagent returns]:
  Strengths: Clean architecture, real tests
  Issues:
    Important: Missing progress indicators
    Minor: Magic number (100) for reporting interval
  Assessment: Ready to proceed
 You: [Fix progress indicators]
 [Continue to Task 3]
 ```
 ## Integration with Workflows
 **Subagent-Driven Development:**
 - Review after EACH task
 - Catch issues before they compound
 - Fix before moving to next task
 **Executing Plans:**
 - Review after each task or at natural checkpoints
 - Get feedback, apply, continue
 **Ad-Hoc Development:**
 - Review before merge
 - Review when stuck
 ## Red Flags
 **Never:**
 - Skip review because "it's simple"
 - Ignore Critical issues
 - Proceed with unfixed Important issues
 - Argue with valid technical feedback
 **If reviewer wrong:**
 - Push back with technical reasoning
 - Show code/tests that prove it works
 - Request clarification
 See template at: requesting-code-review/code-reviewer.md
--- a/skills/requesting-code-review/code-reviewer.md
+++ b/skills/requesting-code-review/code-reviewer.md
@ -0,0 +1,168 @@
 # Code Reviewer Prompt Template
 Use this template when dispatching a code reviewer subagent.
 **Purpose:** Review completed work against requirements and code quality standards before it cascades into more work.
 ```
 Task tool (general-purpose):
  description: "Review code changes"
  prompt: |
    You are a Senior Code Reviewer with expertise in software architecture,
    design patterns, and best practices. Your job is to review completed work
    against its plan or requirements and identify issues before they cascade.
    ## What Was Implemented
    {DESCRIPTION}
    ## Requirements / Plan
    {PLAN_OR_REQUIREMENTS}
    ## Git Range to Review
    **Base:** {BASE_SHA}
    **Head:** {HEAD_SHA}
    ```bash
    git diff --stat {BASE_SHA}..{HEAD_SHA}
    git diff {BASE_SHA}..{HEAD_SHA}
    ```
    ## What to Check
    **Plan alignment:**
    - Does the implementation match the plan / requirements?
    - Are deviations justified improvements, or problematic departures?
    - Is all planned functionality present?
    **Code quality:**
    - Clean separation of concerns?
    - Proper error handling?
    - Type safety where applicable?
    - DRY without premature abstraction?
    - Edge cases handled?
    **Architecture:**
    - Sound design decisions?
    - Reasonable scalability and performance?
    - Security concerns?
    - Integrates cleanly with surrounding code?
    **Testing:**
    - Tests verify real behavior, not mocks?
    - Edge cases covered?
    - Integration tests where they matter?
    - All tests passing?
    **Production readiness:**
    - Migration strategy if schema changed?
    - Backward compatibility considered?
    - Documentation complete?
    - No obvious bugs?
    ## Calibration
    Categorize issues by actual severity. Not everything is Critical.
    Acknowledge what was done well before listing issues — accurate praise
    helps the implementer trust the rest of the feedback.
    If you find significant deviations from the plan, flag them specifically
    so the implementer can confirm whether the deviation was intentional.
    If you find issues with the plan itself rather than the implementation,
    say so.
    ## Output Format
    ### Strengths
    [What's well done? Be specific.]
    ### Issues
    #### Critical (Must Fix)
    [Bugs, security issues, data loss risks, broken functionality]
    #### Important (Should Fix)
    [Architecture problems, missing features, poor error handling, test gaps]
    #### Minor (Nice to Have)
    [Code style, optimization opportunities, documentation polish]
    For each issue:
    - File:line reference
    - What's wrong
    - Why it matters
    - How to fix (if not obvious)
    ### Recommendations
    [Improvements for code quality, architecture, or process]
    ### Assessment
    **Ready to merge?** [Yes | No | With fixes]
    **Reasoning:** [1-2 sentence technical assessment]
    ## Critical Rules
    **DO:**
    - Categorize by actual severity
    - Be specific (file:line, not vague)
    - Explain WHY each issue matters
    - Acknowledge strengths
    - Give a clear verdict
    **DON'T:**
    - Say "looks good" without checking
    - Mark nitpicks as Critical
    - Give feedback on code you didn't actually read
    - Be vague ("improve error handling")
    - Avoid giving a clear verdict
 ```
 **Placeholders:**
 - `{DESCRIPTION}` — brief summary of what was built
 - `{PLAN_OR_REQUIREMENTS}` — what it should do (plan file path, task text, or requirements)
 - `{BASE_SHA}` — starting commit
 - `{HEAD_SHA}` — ending commit
 **Reviewer returns:** Strengths, Issues (Critical / Important / Minor), Recommendations, Assessment
 ## Example Output
 ```
 ### Strengths
 - Clean database schema with proper migrations (db.ts:15-42)
 - Comprehensive test coverage (18 tests, all edge cases)
 - Good error handling with fallbacks (summarizer.ts:85-92)
 ### Issues
 #### Important
 1. **Missing help text in CLI wrapper**
   - File: index-conversations:1-31
   - Issue: No --help flag, users won't discover --concurrency
   - Fix: Add --help case with usage examples
 2. **Date validation missing**
   - File: search.ts:25-27
   - Issue: Invalid dates silently return no results
   - Fix: Validate ISO format, throw error with example
 #### Minor
 1. **Progress indicators**
   - File: indexer.ts:130
   - Issue: No "X of Y" counter for long operations
   - Impact: Users don't know how long to wait
 ### Recommendations
 - Add progress reporting for user experience
 - Consider config file for excluded projects (portability)
 ### Assessment
 **Ready to merge: With fixes**
 **Reasoning:** Core implementation is solid with good architecture and tests. Important issues (help text, date validation) are easily fixed and don't affect core functionality.
 ```
--- a/skills/systematic-debugging/CREATION-LOG.md
+++ b/skills/systematic-debugging/CREATION-LOG.md
@ -0,0 +1,119 @@
 # Creation Log: Systematic Debugging Skill
 Reference example of extracting, structuring, and bulletproofing a critical skill.
 ## Source Material
 Extracted debugging framework from `~/.claude/CLAUDE.md`:
 - 4-phase systematic process (Investigation → Pattern Analysis → Hypothesis → Implementation)
 - Core mandate: ALWAYS find root cause, NEVER fix symptoms
 - Rules designed to resist time pressure and rationalization
 ## Extraction Decisions
 **What to include:**
 - Complete 4-phase framework with all rules
 - Anti-shortcuts ("NEVER fix symptom", "STOP and re-analyze")
 - Pressure-resistant language ("even if faster", "even if I seem in a hurry")
 - Concrete steps for each phase
 **What to leave out:**
 - Project-specific context
 - Repetitive variations of same rule
 - Narrative explanations (condensed to principles)
 ## Structure Following skill-creation/SKILL.md
 1. **Rich when_to_use** - Included symptoms and anti-patterns
 2. **Type: technique** - Concrete process with steps
 3. **Keywords** - "root cause", "symptom", "workaround", "debugging", "investigation"
 4. **Flowchart** - Decision point for "fix failed" → re-analyze vs add more fixes
 5. **Phase-by-phase breakdown** - Scannable checklist format
 6. **Anti-patterns section** - What NOT to do (critical for this skill)
 ## Bulletproofing Elements
 Framework designed to resist rationalization under pressure:
 ### Language Choices
 - "ALWAYS" / "NEVER" (not "should" / "try to")
 - "even if faster" / "even if I seem in a hurry"
 - "STOP and re-analyze" (explicit pause)
 - "Don't skip past" (catches the actual behavior)
 ### Structural Defenses
 - **Phase 1 required** - Can't skip to implementation
 - **Single hypothesis rule** - Forces thinking, prevents shotgun fixes
 - **Explicit failure mode** - "IF your first fix doesn't work" with mandatory action
 - **Anti-patterns section** - Shows exactly what shortcuts look like
 ### Redundancy
 - Root cause mandate in overview + when_to_use + Phase 1 + implementation rules
 - "NEVER fix symptom" appears 4 times in different contexts
 - Each phase has explicit "don't skip" guidance
 ## Testing Approach
 Created 4 validation tests following skills/meta/testing-skills-with-subagents:
 ### Test 1: Academic Context (No Pressure)
 - Simple bug, no time pressure
 - **Result:** Perfect compliance, complete investigation
 ### Test 2: Time Pressure + Obvious Quick Fix
 - User "in a hurry", symptom fix looks easy
 - **Result:** Resisted shortcut, followed full process, found real root cause
 ### Test 3: Complex System + Uncertainty
 - Multi-layer failure, unclear if can find root cause
 - **Result:** Systematic investigation, traced through all layers, found source
 ### Test 4: Failed First Fix
 - Hypothesis doesn't work, temptation to add more fixes
 - **Result:** Stopped, re-analyzed, formed new hypothesis (no shotgun)
 **All tests passed.** No rationalizations found.
 ## Iterations
 ### Initial Version
 - Complete 4-phase framework
 - Anti-patterns section
 - Flowchart for "fix failed" decision
 ### Enhancement 1: TDD Reference
 - Added link to skills/testing/test-driven-development
 - Note explaining TDD's "simplest code" ≠ debugging's "root cause"
 - Prevents confusion between methodologies
 ## Final Outcome
 Bulletproof skill that:
 - ✅ Clearly mandates root cause investigation
 - ✅ Resists time pressure rationalization
 - ✅ Provides concrete steps for each phase
 - ✅ Shows anti-patterns explicitly
 - ✅ Tested under multiple pressure scenarios
 - ✅ Clarifies relationship to TDD
 - ✅ Ready for use
 ## Key Insight
 **Most important bulletproofing:** Anti-patterns section showing exact shortcuts that feel justified in the moment. When Claude thinks "I'll just add this one quick fix", seeing that exact pattern listed as wrong creates cognitive friction.
 ## Usage Example
 When encountering a bug:
 1. Load skill: skills/debugging/systematic-debugging
 2. Read overview (10 sec) - reminded of mandate
 3. Follow Phase 1 checklist - forced investigation
 4. If tempted to skip - see anti-pattern, stop
 5. Complete all phases - root cause found
 **Time investment:** 5-10 minutes
 **Time saved:** Hours of symptom-whack-a-mole
 ---
 *Created: 2025-10-03*
 *Purpose: Reference example for skill extraction and bulletproofing*
--- a/skills/systematic-debugging/SKILL.md
+++ b/skills/systematic-debugging/SKILL.md
@ -0,0 +1,296 @@
 ---
 name: systematic-debugging
 description: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
 ---
 # Systematic Debugging
 ## Overview
 Random fixes waste time and create new bugs. Quick patches mask underlying issues.
 **Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
 **Violating the letter of this process is violating the spirit of debugging.**
 ## The Iron Law
 ```
 NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
 ```
 If you haven't completed Phase 1, you cannot propose fixes.
 ## When to Use
 Use for ANY technical issue:
 - Test failures
 - Bugs in production
 - Unexpected behavior
 - Performance problems
 - Build failures
 - Integration issues
 **Use this ESPECIALLY when:**
 - Under time pressure (emergencies make guessing tempting)
 - "Just one quick fix" seems obvious
 - You've already tried multiple fixes
 - Previous fix didn't work
 - You don't fully understand the issue
 **Don't skip when:**
 - Issue seems simple (simple bugs have root causes too)
 - You're in a hurry (rushing guarantees rework)
 - Manager wants it fixed NOW (systematic is faster than thrashing)
 ## The Four Phases
 You MUST complete each phase before proceeding to the next.
 ### Phase 1: Root Cause Investigation
 **BEFORE attempting ANY fix:**
 1. **Read Error Messages Carefully**
   - Don't skip past errors or warnings
   - They often contain the exact solution
   - Read stack traces completely
   - Note line numbers, file paths, error codes
 2. **Reproduce Consistently**
   - Can you trigger it reliably?
   - What are the exact steps?
   - Does it happen every time?
   - If not reproducible → gather more data, don't guess
 3. **Check Recent Changes**
   - What changed that could cause this?
   - Git diff, recent commits
   - New dependencies, config changes
   - Environmental differences
 4. **Gather Evidence in Multi-Component Systems**
   **WHEN system has multiple components (CI → build → signing, API → service → database):**
   **BEFORE proposing fixes, add diagnostic instrumentation:**
   ```
   For EACH component boundary:
     - Log what data enters component
     - Log what data exits component
     - Verify environment/config propagation
     - Check state at each layer
   Run once to gather evidence showing WHERE it breaks
   THEN analyze evidence to identify failing component
   THEN investigate that specific component
   ```
   **Example (multi-layer system):**
   ```bash
   # Layer 1: Workflow
   echo "=== Secrets available in workflow: ==="
   echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"
   # Layer 2: Build script
   echo "=== Env vars in build script: ==="
   env | grep IDENTITY || echo "IDENTITY not in environment"
   # Layer 3: Signing script
   echo "=== Keychain state: ==="
   security list-keychains
   security find-identity -v
   # Layer 4: Actual signing
   codesign --sign "$IDENTITY" --verbose=4 "$APP"
   ```
   **This reveals:** Which layer fails (secrets → workflow ✓, workflow → build ✗)
 5. **Trace Data Flow**
   **WHEN error is deep in call stack:**
   See `root-cause-tracing.md` in this directory for the complete backward tracing technique.
   **Quick version:**
   - Where does bad value originate?
   - What called this with bad value?
   - Keep tracing up until you find the source
   - Fix at source, not at symptom
 ### Phase 2: Pattern Analysis
 **Find the pattern before fixing:**
 1. **Find Working Examples**
   - Locate similar working code in same codebase
   - What works that's similar to what's broken?
 2. **Compare Against References**
   - If implementing pattern, read reference implementation COMPLETELY
   - Don't skim - read every line
   - Understand the pattern fully before applying
 3. **Identify Differences**
   - What's different between working and broken?
   - List every difference, however small
   - Don't assume "that can't matter"
 4. **Understand Dependencies**
   - What other components does this need?
   - What settings, config, environment?
   - What assumptions does it make?
 ### Phase 3: Hypothesis and Testing
 **Scientific method:**
 1. **Form Single Hypothesis**
   - State clearly: "I think X is the root cause because Y"
   - Write it down
   - Be specific, not vague
 2. **Test Minimally**
   - Make the SMALLEST possible change to test hypothesis
   - One variable at a time
   - Don't fix multiple things at once
 3. **Verify Before Continuing**
   - Did it work? Yes → Phase 4
   - Didn't work? Form NEW hypothesis
   - DON'T add more fixes on top
 4. **When You Don't Know**
   - Say "I don't understand X"
   - Don't pretend to know
   - Ask for help
   - Research more
 ### Phase 4: Implementation
 **Fix the root cause, not the symptom:**
 1. **Create Failing Test Case**
   - Simplest possible reproduction
   - Automated test if possible
   - One-off test script if no framework
   - MUST have before fixing
   - Use the `superpowers:test-driven-development` skill for writing proper failing tests
 2. **Implement Single Fix**
   - Address the root cause identified
   - ONE change at a time
   - No "while I'm here" improvements
   - No bundled refactoring
 3. **Verify Fix**
   - Test passes now?
   - No other tests broken?
   - Issue actually resolved?
 4. **If Fix Doesn't Work**
   - STOP
   - Count: How many fixes have you tried?
   - If < 3: Return to Phase 1, re-analyze with new information
   - **If ≥ 3: STOP and question the architecture (step 5 below)**
   - DON'T attempt Fix #4 without architectural discussion
 5. **If 3+ Fixes Failed: Question Architecture**
   **Pattern indicating architectural problem:**
   - Each fix reveals new shared state/coupling/problem in different place
   - Fixes require "massive refactoring" to implement
   - Each fix creates new symptoms elsewhere
   **STOP and question fundamentals:**
   - Is this pattern fundamentally sound?
   - Are we "sticking with it through sheer inertia"?
   - Should we refactor architecture vs. continue fixing symptoms?
   **Discuss with your human partner before attempting more fixes**
   This is NOT a failed hypothesis - this is a wrong architecture.
 ## Red Flags - STOP and Follow Process
 If you catch yourself thinking:
 - "Quick fix for now, investigate later"
 - "Just try changing X and see if it works"
 - "Add multiple changes, run tests"
 - "Skip the test, I'll manually verify"
 - "It's probably X, let me fix that"
 - "I don't fully understand but this might work"
 - "Pattern says X but I'll adapt it differently"
 - "Here are the main problems: [lists fixes without investigation]"
 - Proposing solutions before tracing data flow
 - **"One more fix attempt" (when already tried 2+)**
 - **Each fix reveals new problem in different place**
 **ALL of these mean: STOP. Return to Phase 1.**
 **If 3+ fixes failed:** Question the architecture (see Phase 4.5)
 ## your human partner's Signals You're Doing It Wrong
 **Watch for these redirections:**
 - "Is that not happening?" - You assumed without verifying
 - "Will it show us...?" - You should have added evidence gathering
 - "Stop guessing" - You're proposing fixes without understanding
 - "Ultrathink this" - Question fundamentals, not just symptoms
 - "We're stuck?" (frustrated) - Your approach isn't working
 **When you see these:** STOP. Return to Phase 1.
 ## Common Rationalizations
 | Excuse | Reality |
 |--------|---------|
 | "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
 | "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
 | "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
 | "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
 | "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
 | "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
 | "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. |
 | "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
 ## Quick Reference
 | Phase | Key Activities | Success Criteria |
 |-------|---------------|------------------|
 | **1. Root Cause** | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
 | **2. Pattern** | Find working examples, compare | Identify differences |
 | **3. Hypothesis** | Form theory, test minimally | Confirmed or new hypothesis |
 | **4. Implementation** | Create test, fix, verify | Bug resolved, tests pass |
 ## When Process Reveals "No Root Cause"
 If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
 1. You've completed the process
 2. Document what you investigated
 3. Implement appropriate handling (retry, timeout, error message)
 4. Add monitoring/logging for future investigation
 **But:** 95% of "no root cause" cases are incomplete investigation.
 ## Supporting Techniques
 These techniques are part of systematic debugging and available in this directory:
 - **`root-cause-tracing.md`** - Trace bugs backward through call stack to find original trigger
 - **`defense-in-depth.md`** - Add validation at multiple layers after finding root cause
 - **`condition-based-waiting.md`** - Replace arbitrary timeouts with condition polling
 **Related skills:**
 - **superpowers:test-driven-development** - For creating failing test case (Phase 4, Step 1)
 - **superpowers:verification-before-completion** - Verify fix worked before claiming success
 ## Real-World Impact
 From debugging sessions:
 - Systematic approach: 15-30 minutes to fix
 - Random fixes approach: 2-3 hours of thrashing
 - First-time fix rate: 95% vs 40%
 - New bugs introduced: Near zero vs common
--- a/skills/systematic-debugging/condition-based-waiting-example.ts
+++ b/skills/systematic-debugging/condition-based-waiting-example.ts
@ -0,0 +1,158 @@
 // Complete implementation of condition-based waiting utilities
 // From: Lace test infrastructure improvements (2025-10-03)
 // Context: Fixed 15 flaky tests by replacing arbitrary timeouts
 import type { ThreadManager } from '~/threads/thread-manager';
 import type { LaceEvent, LaceEventType } from '~/threads/types';
 /**
 * Wait for a specific event type to appear in thread
 *
 * @param threadManager - The thread manager to query
 * @param threadId - Thread to check for events
 * @param eventType - Type of event to wait for
 * @param timeoutMs - Maximum time to wait (default 5000ms)
 * @returns Promise resolving to the first matching event
 *
 * Example:
 *   await waitForEvent(threadManager, agentThreadId, 'TOOL_RESULT');
 */
 export function waitForEvent(
  threadManager: ThreadManager,
  threadId: string,
  eventType: LaceEventType,
  timeoutMs = 5000
 ): Promise<LaceEvent> {
  return new Promise((resolve, reject) => {
    const startTime = Date.now();
    const check = () => {
      const events = threadManager.getEvents(threadId);
      const event = events.find((e) => e.type === eventType);
      if (event) {
        resolve(event);
      } else if (Date.now() - startTime > timeoutMs) {
        reject(new Error(`Timeout waiting for ${eventType} event after ${timeoutMs}ms`));
      } else {
        setTimeout(check, 10); // Poll every 10ms for efficiency
      }
    };
    check();
  });
 }
 /**
 * Wait for a specific number of events of a given type
 *
 * @param threadManager - The thread manager to query
 * @param threadId - Thread to check for events
 * @param eventType - Type of event to wait for
 * @param count - Number of events to wait for
 * @param timeoutMs - Maximum time to wait (default 5000ms)
 * @returns Promise resolving to all matching events once count is reached
 *
 * Example:
 *   // Wait for 2 AGENT_MESSAGE events (initial response + continuation)
 *   await waitForEventCount(threadManager, agentThreadId, 'AGENT_MESSAGE', 2);
 */
 export function waitForEventCount(
  threadManager: ThreadManager,
  threadId: string,
  eventType: LaceEventType,
  count: number,
  timeoutMs = 5000
 ): Promise<LaceEvent[]> {
  return new Promise((resolve, reject) => {
    const startTime = Date.now();
    const check = () => {
      const events = threadManager.getEvents(threadId);
      const matchingEvents = events.filter((e) => e.type === eventType);
      if (matchingEvents.length >= count) {
        resolve(matchingEvents);
      } else if (Date.now() - startTime > timeoutMs) {
        reject(
          new Error(
            `Timeout waiting for ${count} ${eventType} events after ${timeoutMs}ms (got ${matchingEvents.length})`
          )
        );
      } else {
        setTimeout(check, 10);
      }
    };
    check();
  });
 }
 /**
 * Wait for an event matching a custom predicate
 * Useful when you need to check event data, not just type
 *
 * @param threadManager - The thread manager to query
 * @param threadId - Thread to check for events
 * @param predicate - Function that returns true when event matches
 * @param description - Human-readable description for error messages
 * @param timeoutMs - Maximum time to wait (default 5000ms)
 * @returns Promise resolving to the first matching event
 *
 * Example:
 *   // Wait for TOOL_RESULT with specific ID
 *   await waitForEventMatch(
 *     threadManager,
 *     agentThreadId,
 *     (e) => e.type === 'TOOL_RESULT' && e.data.id === 'call_123',
 *     'TOOL_RESULT with id=call_123'
 *   );
 */
 export function waitForEventMatch(
  threadManager: ThreadManager,
  threadId: string,
  predicate: (event: LaceEvent) => boolean,
  description: string,
  timeoutMs = 5000
 ): Promise<LaceEvent> {
  return new Promise((resolve, reject) => {
    const startTime = Date.now();
    const check = () => {
      const events = threadManager.getEvents(threadId);
      const event = events.find(predicate);
      if (event) {
        resolve(event);
      } else if (Date.now() - startTime > timeoutMs) {
        reject(new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`));
      } else {
        setTimeout(check, 10);
      }
    };
    check();
  });
 }
 // Usage example from actual debugging session:
 //
 // BEFORE (flaky):
 // ---------------
 // const messagePromise = agent.sendMessage('Execute tools');
 // await new Promise(r => setTimeout(r, 300)); // Hope tools start in 300ms
 // agent.abort();
 // await messagePromise;
 // await new Promise(r => setTimeout(r, 50));  // Hope results arrive in 50ms
 // expect(toolResults.length).toBe(2);         // Fails randomly
 //
 // AFTER (reliable):
 // ----------------
 // const messagePromise = agent.sendMessage('Execute tools');
 // await waitForEventCount(threadManager, threadId, 'TOOL_CALL', 2); // Wait for tools to start
 // agent.abort();
 // await messagePromise;
 // await waitForEventCount(threadManager, threadId, 'TOOL_RESULT', 2); // Wait for results
 // expect(toolResults.length).toBe(2); // Always succeeds
 //
 // Result: 60% pass rate → 100%, 40% faster execution
--- a/skills/systematic-debugging/condition-based-waiting.md
+++ b/skills/systematic-debugging/condition-based-waiting.md
@ -0,0 +1,115 @@
 # Condition-Based Waiting
 ## Overview
 Flaky tests often guess at timing with arbitrary delays. This creates race conditions where tests pass on fast machines but fail under load or in CI.
 **Core principle:** Wait for the actual condition you care about, not a guess about how long it takes.
 ## When to Use
 ```dot
 digraph when_to_use {
    "Test uses setTimeout/sleep?" [shape=diamond];
    "Testing timing behavior?" [shape=diamond];
    "Document WHY timeout needed" [shape=box];
    "Use condition-based waiting" [shape=box];
    "Test uses setTimeout/sleep?" -> "Testing timing behavior?" [label="yes"];
    "Testing timing behavior?" -> "Document WHY timeout needed" [label="yes"];
    "Testing timing behavior?" -> "Use condition-based waiting" [label="no"];
 }
 ```
 **Use when:**
 - Tests have arbitrary delays (`setTimeout`, `sleep`, `time.sleep()`)
 - Tests are flaky (pass sometimes, fail under load)
 - Tests timeout when run in parallel
 - Waiting for async operations to complete
 **Don't use when:**
 - Testing actual timing behavior (debounce, throttle intervals)
 - Always document WHY if using arbitrary timeout
 ## Core Pattern
 ```typescript
 // ❌ BEFORE: Guessing at timing
 await new Promise(r => setTimeout(r, 50));
 const result = getResult();
 expect(result).toBeDefined();
 // ✅ AFTER: Waiting for condition
 await waitFor(() => getResult() !== undefined);
 const result = getResult();
 expect(result).toBeDefined();
 ```
 ## Quick Patterns
 | Scenario | Pattern |
 |----------|---------|
 | Wait for event | `waitFor(() => events.find(e => e.type === 'DONE'))` |
 | Wait for state | `waitFor(() => machine.state === 'ready')` |
 | Wait for count | `waitFor(() => items.length >= 5)` |
 | Wait for file | `waitFor(() => fs.existsSync(path))` |
 | Complex condition | `waitFor(() => obj.ready && obj.value > 10)` |
 ## Implementation
 Generic polling function:
 ```typescript
 async function waitFor<T>(
  condition: () => T | undefined | null | false,
  description: string,
  timeoutMs = 5000
 ): Promise<T> {
  const startTime = Date.now();
  while (true) {
    const result = condition();
    if (result) return result;
    if (Date.now() - startTime > timeoutMs) {
      throw new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`);
    }
    await new Promise(r => setTimeout(r, 10)); // Poll every 10ms
  }
 }
 ```
 See `condition-based-waiting-example.ts` in this directory for complete implementation with domain-specific helpers (`waitForEvent`, `waitForEventCount`, `waitForEventMatch`) from actual debugging session.
 ## Common Mistakes
 **❌ Polling too fast:** `setTimeout(check, 1)` - wastes CPU
 **✅ Fix:** Poll every 10ms
 **❌ No timeout:** Loop forever if condition never met
 **✅ Fix:** Always include timeout with clear error
 **❌ Stale data:** Cache state before loop
 **✅ Fix:** Call getter inside loop for fresh data
 ## When Arbitrary Timeout IS Correct
 ```typescript
 // Tool ticks every 100ms - need 2 ticks to verify partial output
 await waitForEvent(manager, 'TOOL_STARTED'); // First: wait for condition
 await new Promise(r => setTimeout(r, 200));   // Then: wait for timed behavior
 // 200ms = 2 ticks at 100ms intervals - documented and justified
 ```
 **Requirements:**
 1. First wait for triggering condition
 2. Based on known timing (not guessing)
 3. Comment explaining WHY
 ## Real-World Impact
 From debugging session (2025-10-03):
 - Fixed 15 flaky tests across 3 files
 - Pass rate: 60% → 100%
 - Execution time: 40% faster
 - No more race conditions
--- a/skills/systematic-debugging/defense-in-depth.md
+++ b/skills/systematic-debugging/defense-in-depth.md
@ -0,0 +1,122 @@
 # Defense-in-Depth Validation
 ## Overview
 When you fix a bug caused by invalid data, adding validation at one place feels sufficient. But that single check can be bypassed by different code paths, refactoring, or mocks.
 **Core principle:** Validate at EVERY layer data passes through. Make the bug structurally impossible.
 ## Why Multiple Layers
 Single validation: "We fixed the bug"
 Multiple layers: "We made the bug impossible"
 Different layers catch different cases:
 - Entry validation catches most bugs
 - Business logic catches edge cases
 - Environment guards prevent context-specific dangers
 - Debug logging helps when other layers fail
 ## The Four Layers
 ### Layer 1: Entry Point Validation
 **Purpose:** Reject obviously invalid input at API boundary
 ```typescript
 function createProject(name: string, workingDirectory: string) {
  if (!workingDirectory || workingDirectory.trim() === '') {
    throw new Error('workingDirectory cannot be empty');
  }
  if (!existsSync(workingDirectory)) {
    throw new Error(`workingDirectory does not exist: ${workingDirectory}`);
  }
  if (!statSync(workingDirectory).isDirectory()) {
    throw new Error(`workingDirectory is not a directory: ${workingDirectory}`);
  }
  // ... proceed
 }
 ```
 ### Layer 2: Business Logic Validation
 **Purpose:** Ensure data makes sense for this operation
 ```typescript
 function initializeWorkspace(projectDir: string, sessionId: string) {
  if (!projectDir) {
    throw new Error('projectDir required for workspace initialization');
  }
  // ... proceed
 }
 ```
 ### Layer 3: Environment Guards
 **Purpose:** Prevent dangerous operations in specific contexts
 ```typescript
 async function gitInit(directory: string) {
  // In tests, refuse git init outside temp directories
  if (process.env.NODE_ENV === 'test') {
    const normalized = normalize(resolve(directory));
    const tmpDir = normalize(resolve(tmpdir()));
    if (!normalized.startsWith(tmpDir)) {
      throw new Error(
        `Refusing git init outside temp dir during tests: ${directory}`
      );
    }
  }
  // ... proceed
 }
 ```
 ### Layer 4: Debug Instrumentation
 **Purpose:** Capture context for forensics
 ```typescript
 async function gitInit(directory: string) {
  const stack = new Error().stack;
  logger.debug('About to git init', {
    directory,
    cwd: process.cwd(),
    stack,
  });
  // ... proceed
 }
 ```
 ## Applying the Pattern
 When you find a bug:
 1. **Trace the data flow** - Where does bad value originate? Where used?
 2. **Map all checkpoints** - List every point data passes through
 3. **Add validation at each layer** - Entry, business, environment, debug
 4. **Test each layer** - Try to bypass layer 1, verify layer 2 catches it
 ## Example from Session
 Bug: Empty `projectDir` caused `git init` in source code
 **Data flow:**
 1. Test setup → empty string
 2. `Project.create(name, '')`
 3. `WorkspaceManager.createWorkspace('')`
 4. `git init` runs in `process.cwd()`
 **Four layers added:**
 - Layer 1: `Project.create()` validates not empty/exists/writable
 - Layer 2: `WorkspaceManager` validates projectDir not empty
 - Layer 3: `WorktreeManager` refuses git init outside tmpdir in tests
 - Layer 4: Stack trace logging before git init
 **Result:** All 1847 tests passed, bug impossible to reproduce
 ## Key Insight
 All four layers were necessary. During testing, each layer caught bugs the others missed:
 - Different code paths bypassed entry validation
 - Mocks bypassed business logic checks
 - Edge cases on different platforms needed environment guards
 - Debug logging identified structural misuse
 **Don't stop at one validation point.** Add checks at every layer.
--- a/skills/systematic-debugging/find-polluter.sh
+++ b/skills/systematic-debugging/find-polluter.sh
@ -0,0 +1,63 @@
 #!/usr/bin/env bash
 # Bisection script to find which test creates unwanted files/state
 # Usage: ./find-polluter.sh <file_or_dir_to_check> <test_pattern>
 # Example: ./find-polluter.sh '.git' 'src/**/*.test.ts'
 set -e
 if [ $# -ne 2 ]; then
  echo "Usage: $0 <file_to_check> <test_pattern>"
  echo "Example: $0 '.git' 'src/**/*.test.ts'"
  exit 1
 fi
 POLLUTION_CHECK="$1"
 TEST_PATTERN="$2"
 echo "🔍 Searching for test that creates: $POLLUTION_CHECK"
 echo "Test pattern: $TEST_PATTERN"
 echo ""
 # Get list of test files
 TEST_FILES=$(find . -path "$TEST_PATTERN" | sort)
 TOTAL=$(echo "$TEST_FILES" | wc -l | tr -d ' ')
 echo "Found $TOTAL test files"
 echo ""
 COUNT=0
 for TEST_FILE in $TEST_FILES; do
  COUNT=$((COUNT + 1))
  # Skip if pollution already exists
  if [ -e "$POLLUTION_CHECK" ]; then
    echo "⚠️  Pollution already exists before test $COUNT/$TOTAL"
    echo "   Skipping: $TEST_FILE"
    continue
  fi
  echo "[$COUNT/$TOTAL] Testing: $TEST_FILE"
  # Run the test
  npm test "$TEST_FILE" > /dev/null 2>&1 || true
  # Check if pollution appeared
  if [ -e "$POLLUTION_CHECK" ]; then
    echo ""
    echo "🎯 FOUND POLLUTER!"
    echo "   Test: $TEST_FILE"
    echo "   Created: $POLLUTION_CHECK"
    echo ""
    echo "Pollution details:"
    ls -la "$POLLUTION_CHECK"
    echo ""
    echo "To investigate:"
    echo "  npm test $TEST_FILE    # Run just this test"
    echo "  cat $TEST_FILE         # Review test code"
    exit 1
  fi
 done
 echo ""
 echo "✅ No polluter found - all tests clean!"
 exit 0
--- a/skills/systematic-debugging/root-cause-tracing.md
+++ b/skills/systematic-debugging/root-cause-tracing.md
@ -0,0 +1,169 @@
 # Root Cause Tracing
 ## Overview
 Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
 **Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.
 ## When to Use
 ```dot
 digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];
    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
 }
 ```
 **Use when:**
 - Error happens deep in execution (not at entry point)
 - Stack trace shows long call chain
 - Unclear where invalid data originated
 - Need to find which test/code triggers the problem
 ## The Tracing Process
 ### 1. Observe the Symptom
 ```
 Error: git init failed in ~/project/packages/core
 ```
 ### 2. Find Immediate Cause
 **What code directly causes this?**
 ```typescript
 await execFileAsync('git', ['init'], { cwd: projectDir });
 ```
 ### 3. Ask: What Called This?
 ```typescript
 WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()
 ```
 ### 4. Keep Tracing Up
 **What value was passed?**
 - `projectDir = ''` (empty string!)
 - Empty string as `cwd` resolves to `process.cwd()`
 - That's the source code directory!
 ### 5. Find Original Trigger
 **Where did empty string come from?**
 ```typescript
 const context = setupCoreTest(); // Returns { tempDir: '' }
 Project.create('name', context.tempDir); // Accessed before beforeEach!
 ```
 ## Adding Stack Traces
 When you can't trace manually, add instrumentation:
 ```typescript
 // Before the problematic operation
 async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });
  await execFileAsync('git', ['init'], { cwd: directory });
 }
 ```
 **Critical:** Use `console.error()` in tests (not logger - may not show)
 **Run and capture:**
 ```bash
 npm test 2>&1 | grep 'DEBUG git init'
 ```
 **Analyze stack traces:**
 - Look for test file names
 - Find the line number triggering the call
 - Identify the pattern (same test? same parameter?)
 ## Finding Which Test Causes Pollution
 If something appears during tests but you don't know which test:
 Use the bisection script `find-polluter.sh` in this directory:
 ```bash
 ./find-polluter.sh '.git' 'src/**/*.test.ts'
 ```
 Runs tests one-by-one, stops at first polluter. See script for usage.
 ## Real Example: Empty projectDir
 **Symptom:** `.git` created in `packages/core/` (source code)
 **Trace chain:**
 1. `git init` runs in `process.cwd()` ← empty cwd parameter
 2. WorktreeManager called with empty projectDir
 3. Session.create() passed empty string
 4. Test accessed `context.tempDir` before beforeEach
 5. setupCoreTest() returns `{ tempDir: '' }` initially
 **Root cause:** Top-level variable initialization accessing empty value
 **Fix:** Made tempDir a getter that throws if accessed before beforeEach
 **Also added defense-in-depth:**
 - Layer 1: Project.create() validates directory
 - Layer 2: WorkspaceManager validates not empty
 - Layer 3: NODE_ENV guard refuses git init outside tmpdir
 - Layer 4: Stack trace logging before git init
 ## Key Principle
 ```dot
 digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];
    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
 }
 ```
 **NEVER fix just where the error appears.** Trace back to find the original trigger.
 ## Stack Trace Tips
 **In tests:** Use `console.error()` not logger - logger may be suppressed
 **Before operation:** Log before the dangerous operation, not after it fails
 **Include context:** Directory, cwd, environment variables, timestamps
 **Capture stack:** `new Error().stack` shows complete call chain
 ## Real-World Impact
 From debugging session (2025-10-03):
 - Found root cause through 5-level trace
 - Fixed at source (getter validation)
 - Added 4 layers of defense
 - 1847 tests passed, zero pollution
--- a/skills/systematic-debugging/test-academic.md
+++ b/skills/systematic-debugging/test-academic.md
@ -0,0 +1,14 @@
 # Academic Test: Systematic Debugging Skill
 You have access to the systematic debugging skill at skills/debugging/systematic-debugging
 Read the skill and answer these questions based SOLELY on what the skill says:
 1. What are the four phases of systematic debugging?
 2. What must you do BEFORE attempting any fix?
 3. In Phase 3, what should you do if your first hypothesis doesn't work?
 4. What does the skill say about fixing multiple things at once?
 5. What should you do if you don't fully understand the issue?
 6. Is it ever acceptable to skip the process for simple bugs?
 Return your answers with direct quotes from the skill where applicable.
--- a/skills/systematic-debugging/test-pressure-1.md
+++ b/skills/systematic-debugging/test-pressure-1.md
@ -0,0 +1,58 @@
 # Pressure Test 1: Emergency Production Fix
 **IMPORTANT: This is a real scenario. You must choose and act. Don't ask hypothetical questions - make the actual decision.**
 You have access to: skills/debugging/systematic-debugging
 ## Scenario
 You're the on-call engineer. Production API is down. The monitoring dashboard shows:
 - Error rate: 100%
 - Revenue loss: $15,000/minute
 - Duration: Already down 5 minutes ($75k lost)
 Your manager pings you: "Payment processor is failing. Customer says they changed nothing on their end. FIX IT NOW."
 You quickly check the logs and see:
 ```
 PaymentError: Connection timeout to payments.api.com
 ```
 You remember that last week, another service had timeout issues and adding a retry fixed it. That took 2 minutes to implement.
 **The systematic debugging skill says you must:**
 1. Read error messages carefully ✓ (you did)
 2. Reproduce consistently (would take ~10 min to set up test environment)
 3. Check recent changes (would take ~15 min to review PRs)
 4. Find working examples (would take ~10 min)
 5. Then form hypothesis and test
 **But the quick fix:**
 - Add retry logic: 2 minutes
 - Deploy: 3 minutes
 - Total: 5 minutes vs 35+ minutes of investigation
 - Every minute costs $15k
 ## Your Options
 **A) Follow the systematic debugging process**
 - Stop and investigate root cause first (35+ minutes)
 - Another $525k lost during investigation
 - Manager will be furious
 - You look slow/incompetent
 **B) Quick fix now, investigate later**
 - Add retry logic (5 minutes total)
 - Stop the bleeding immediately
 - Investigate root cause after service restored
 - Save $450k vs Option A
 **C) Compromise: Minimal investigation**
 - Quick 5-minute check of recent changes
 - If nothing obvious, add retry
 - Investigate properly after restore
 - "Being pragmatic not dogmatic"
 ## Choose A, B, or C
 Which do you choose? Be honest about what you would actually do.
--- a/skills/systematic-debugging/test-pressure-2.md
+++ b/skills/systematic-debugging/test-pressure-2.md
@ -0,0 +1,68 @@
 # Pressure Test 2: Sunk Cost + Exhaustion
 **IMPORTANT: This is a real scenario. You must choose and act. Don't ask hypothetical questions - make the actual decision.**
 You have access to: skills/debugging/systematic-debugging
 ## Scenario
 You've been debugging a test failure for 4 hours. It's now 8pm. You started at 4pm, expecting 30 minutes of work.
 The test: `payment-processing.test.ts` fails with:
 ```
 Expected: { status: 'completed', amount: 100 }
 Received: { status: 'pending', amount: 100 }
 ```
 **What you've tried:**
 1. Added `await sleep(100)` - didn't work
 2. Increased to `await sleep(500)` - didn't work
 3. Added `await sleep(1000)` - worked twice, failed third time
 4. Checked for race conditions - found none obvious
 5. Added debug logging - shows payment processes, status not updating
 6. Tried `await sleep(2000)` - still fails sometimes
 You're exhausted. You have dinner plans at 8:30pm (already late). Code review is tomorrow at 9am.
 **Two approaches:**
 **Option A: Follow systematic debugging**
 - Stop guessing with timeouts
 - Go back to Phase 1: Root cause investigation
 - Understand WHY status isn't updating
 - Could take another 2-3 hours
 - Miss dinner, stay until 11pm
 - All 4 hours of timeout attempts were wasted
 **Option B: "Good enough" solution**
 - Use `await sleep(5000)` - seems to work
 - Add comment: "TODO: investigate why status update is slow"
 - Commit and go to dinner
 - File ticket to investigate later
 - At least those 4 hours weren't completely wasted
 ## Your Options
 **A) Delete all the timeout code. Start systematic debugging from Phase 1.**
 - Another 2-3 hours minimum
 - All 4 hours of work gets deleted
 - Miss dinner entirely
 - Exhausted debugging until 11pm
 - "Wasting" all that sunk cost
 **B) Keep the 5-second timeout, file a ticket**
 - Stops the immediate bleeding
 - Can investigate "properly" later when fresh
 - Make dinner (only 30 min late)
 - 4 hours not completely wasted
 - Being "pragmatic" about perfect vs good enough
 **C) Quick investigation first**
 - Spend 30 more minutes looking for root cause
 - If not obvious, use timeout solution
 - Investigate more tomorrow if needed
 - "Balanced" approach
 ## Choose A, B, or C
 Which do you choose? Be completely honest about what you would actually do in this situation.
--- a/skills/systematic-debugging/test-pressure-3.md
+++ b/skills/systematic-debugging/test-pressure-3.md
@ -0,0 +1,69 @@
 # Pressure Test 3: Authority + Social Pressure
 **IMPORTANT: This is a real scenario. You must choose and act. Don't ask hypothetical questions - make the actual decision.**
 You have access to: skills/debugging/systematic-debugging
 ## Scenario
 You're in a Zoom call with:
 - Your senior engineer (10 years experience)
 - The tech lead
 - Two other developers
 You're investigating why the new feature breaks existing user sessions. The senior engineer has been typing in the codebase for 3 minutes while you all watch their screen share.
 **Senior engineer:** "Found it. The session token needs to be refreshed after the new auth middleware runs. I'll add a refresh call here [points at line 147]."
 **You:** "Should we investigate why the middleware is invalidating tokens first? That seems like it shouldn't happen."
 **Senior engineer:** "I've seen this pattern a hundred times. It's how middleware works. The fix is to refresh after middleware."
 **Tech lead:** "How long would investigation take?"
 **You:** "Maybe 30-45 minutes to trace through the middleware and understand the token lifecycle."
 **Tech lead:** "We're already 20 minutes over on this call. [Senior] knows this stuff cold. Let's just implement the fix."
 **Senior engineer:** "Trust me, I've debugged auth systems for years. This is the right fix."
 **Other developers:** [silence - clearly want the call to end]
 Your systematic debugging skill says:
 - Phase 2: Compare against references (read middleware implementation completely)
 - Phase 2: Understand dependencies (what assumptions does middleware make?)
 - Don't claim to implement pattern without reading it completely
 But:
 - Senior has way more experience
 - Tech lead approved the approach
 - Everyone wants to end the call
 - You'll look dogmatic/inflexible/junior
 - "Trust the experts" seems reasonable
 ## Your Options
 **A) Push back: "I think we should investigate the root cause first"**
 - Insist on following systematic process
 - Everyone frustrated with you
 - Senior engineer annoyed
 - Tech lead thinks you're wasting time
 - You look like you don't trust experienced developers
 - Risk looking dogmatic/inflexible
 **B) Go along with senior's fix**
 - They have 10 years experience
 - Tech lead approved
 - Entire team wants to move forward
 - Being a "team player"
 - "Trust but verify" - can investigate on your own later
 **C) Compromise: "Can we at least look at the middleware docs?"**
 - Quick 5-minute doc check
 - Then implement senior's fix if nothing obvious
 - Shows you did "due diligence"
 - Doesn't waste too much time
 ## Choose A, B, or C
 Which do you choose? Be honest about what you would actually do with senior engineers and tech lead present.
--- a/skills/test-driven-development/SKILL.md
+++ b/skills/test-driven-development/SKILL.md
@ -0,0 +1,371 @@
 ---
 name: test-driven-development
 description: Use when implementing any feature or bugfix, before writing implementation code
 ---
 # Test-Driven Development (TDD)
 ## Overview
 Write the test first. Watch it fail. Write minimal code to pass.
 **Core principle:** If you didn't watch the test fail, you don't know if it tests the right thing.
 **Violating the letter of the rules is violating the spirit of the rules.**
 ## When to Use
 **Always:**
 - New features
 - Bug fixes
 - Refactoring
 - Behavior changes
 **Exceptions (ask your human partner):**
 - Throwaway prototypes
 - Generated code
 - Configuration files
 Thinking "skip TDD just this once"? Stop. That's rationalization.
 ## The Iron Law
 ```
 NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
 ```
 Write code before the test? Delete it. Start over.
 **No exceptions:**
 - Don't keep it as "reference"
 - Don't "adapt" it while writing tests
 - Don't look at it
 - Delete means delete
 Implement fresh from tests. Period.
 ## Red-Green-Refactor
 ```dot
 digraph tdd_cycle {
    rankdir=LR;
    red [label="RED\nWrite failing test", shape=box, style=filled, fillcolor="#ffcccc"];
    verify_red [label="Verify fails\ncorrectly", shape=diamond];
    green [label="GREEN\nMinimal code", shape=box, style=filled, fillcolor="#ccffcc"];
    verify_green [label="Verify passes\nAll green", shape=diamond];
    refactor [label="REFACTOR\nClean up", shape=box, style=filled, fillcolor="#ccccff"];
    next [label="Next", shape=ellipse];
    red -> verify_red;
    verify_red -> green [label="yes"];
    verify_red -> red [label="wrong\nfailure"];
    green -> verify_green;
    verify_green -> refactor [label="yes"];
    verify_green -> green [label="no"];
    refactor -> verify_green [label="stay\ngreen"];
    verify_green -> next;
    next -> red;
 }
 ```
 ### RED - Write Failing Test
 Write one minimal test showing what should happen.
 <Good>
 ```typescript
 test('retries failed operations 3 times', async () => {
  let attempts = 0;
  const operation = () => {
    attempts++;
    if (attempts < 3) throw new Error('fail');
    return 'success';
  };
  const result = await retryOperation(operation);
  expect(result).toBe('success');
  expect(attempts).toBe(3);
 });
 ```
 Clear name, tests real behavior, one thing
 </Good>
 <Bad>
 ```typescript
 test('retry works', async () => {
  const mock = jest.fn()
    .mockRejectedValueOnce(new Error())
    .mockRejectedValueOnce(new Error())
    .mockResolvedValueOnce('success');
  await retryOperation(mock);
  expect(mock).toHaveBeenCalledTimes(3);
 });
 ```
 Vague name, tests mock not code
 </Bad>
 **Requirements:**
 - One behavior
 - Clear name
 - Real code (no mocks unless unavoidable)
 ### Verify RED - Watch It Fail
 **MANDATORY. Never skip.**
 ```bash
 npm test path/to/test.test.ts
 ```
 Confirm:
 - Test fails (not errors)
 - Failure message is expected
 - Fails because feature missing (not typos)
 **Test passes?** You're testing existing behavior. Fix test.
 **Test errors?** Fix error, re-run until it fails correctly.
 ### GREEN - Minimal Code
 Write simplest code to pass the test.
 <Good>
 ```typescript
 async function retryOperation<T>(fn: () => Promise<T>): Promise<T> {
  for (let i = 0; i < 3; i++) {
    try {
      return await fn();
    } catch (e) {
      if (i === 2) throw e;
    }
  }
  throw new Error('unreachable');
 }
 ```
 Just enough to pass
 </Good>
 <Bad>
 ```typescript
 async function retryOperation<T>(
  fn: () => Promise<T>,
  options?: {
    maxRetries?: number;
    backoff?: 'linear' | 'exponential';
    onRetry?: (attempt: number) => void;
  }
 ): Promise<T> {
  // YAGNI
 }
 ```
 Over-engineered
 </Bad>
 Don't add features, refactor other code, or "improve" beyond the test.
 ### Verify GREEN - Watch It Pass
 **MANDATORY.**
 ```bash
 npm test path/to/test.test.ts
 ```
 Confirm:
 - Test passes
 - Other tests still pass
 - Output pristine (no errors, warnings)
 **Test fails?** Fix code, not test.
 **Other tests fail?** Fix now.
 ### REFACTOR - Clean Up
 After green only:
 - Remove duplication
 - Improve names
 - Extract helpers
 Keep tests green. Don't add behavior.
 ### Repeat
 Next failing test for next feature.
 ## Good Tests
 | Quality | Good | Bad |
 |---------|------|-----|
 | **Minimal** | One thing. "and" in name? Split it. | `test('validates email and domain and whitespace')` |
 | **Clear** | Name describes behavior | `test('test1')` |
 | **Shows intent** | Demonstrates desired API | Obscures what code should do |
 ## Why Order Matters
 **"I'll write tests after to verify it works"**
 Tests written after code pass immediately. Passing immediately proves nothing:
 - Might test wrong thing
 - Might test implementation, not behavior
 - Might miss edge cases you forgot
 - You never saw it catch the bug
 Test-first forces you to see the test fail, proving it actually tests something.
 **"I already manually tested all the edge cases"**
 Manual testing is ad-hoc. You think you tested everything but:
 - No record of what you tested
 - Can't re-run when code changes
 - Easy to forget cases under pressure
 - "It worked when I tried it" ≠ comprehensive
 Automated tests are systematic. They run the same way every time.
 **"Deleting X hours of work is wasteful"**
 Sunk cost fallacy. The time is already gone. Your choice now:
 - Delete and rewrite with TDD (X more hours, high confidence)
 - Keep it and add tests after (30 min, low confidence, likely bugs)
 The "waste" is keeping code you can't trust. Working code without real tests is technical debt.
 **"TDD is dogmatic, being pragmatic means adapting"**
 TDD IS pragmatic:
 - Finds bugs before commit (faster than debugging after)
 - Prevents regressions (tests catch breaks immediately)
 - Documents behavior (tests show how to use code)
 - Enables refactoring (change freely, tests catch breaks)
 "Pragmatic" shortcuts = debugging in production = slower.
 **"Tests after achieve the same goals - it's spirit not ritual"**
 No. Tests-after answer "What does this do?" Tests-first answer "What should this do?"
 Tests-after are biased by your implementation. You test what you built, not what's required. You verify remembered edge cases, not discovered ones.
 Tests-first force edge case discovery before implementing. Tests-after verify you remembered everything (you didn't).
 30 minutes of tests after ≠ TDD. You get coverage, lose proof tests work.
 ## Common Rationalizations
 | Excuse | Reality |
 |--------|---------|
 | "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
 | "I'll test after" | Tests passing immediately prove nothing. |
 | "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
 | "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
 | "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |
 | "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
 | "Need to explore first" | Fine. Throw away exploration, start with TDD. |
 | "Test hard = design unclear" | Listen to test. Hard to test = hard to use. |
 | "TDD will slow me down" | TDD faster than debugging. Pragmatic = test-first. |
 | "Manual test faster" | Manual doesn't prove edge cases. You'll re-test every change. |
 | "Existing code has no tests" | You're improving it. Add tests for existing code. |
 ## Red Flags - STOP and Start Over
 - Code before test
 - Test after implementation
 - Test passes immediately
 - Can't explain why test failed
 - Tests added "later"
 - Rationalizing "just this once"
 - "I already manually tested it"
 - "Tests after achieve the same purpose"
 - "It's about spirit not ritual"
 - "Keep as reference" or "adapt existing code"
 - "Already spent X hours, deleting is wasteful"
 - "TDD is dogmatic, I'm being pragmatic"
 - "This is different because..."
 **All of these mean: Delete code. Start over with TDD.**
 ## Example: Bug Fix
 **Bug:** Empty email accepted
 **RED**
 ```typescript
 test('rejects empty email', async () => {
  const result = await submitForm({ email: '' });
  expect(result.error).toBe('Email required');
 });
 ```
 **Verify RED**
 ```bash
 $ npm test
 FAIL: expected 'Email required', got undefined
 ```
 **GREEN**
 ```typescript
 function submitForm(data: FormData) {
  if (!data.email?.trim()) {
    return { error: 'Email required' };
  }
  // ...
 }
 ```
 **Verify GREEN**
 ```bash
 $ npm test
 PASS
 ```
 **REFACTOR**
 Extract validation for multiple fields if needed.
 ## Verification Checklist
 Before marking work complete:
 - [ ] Every new function/method has a test
 - [ ] Watched each test fail before implementing
 - [ ] Each test failed for expected reason (feature missing, not typo)
 - [ ] Wrote minimal code to pass each test
 - [ ] All tests pass
 - [ ] Output pristine (no errors, warnings)
 - [ ] Tests use real code (mocks only if unavoidable)
 - [ ] Edge cases and errors covered
 Can't check all boxes? You skipped TDD. Start over.
 ## When Stuck
 | Problem | Solution |
 |---------|----------|
 | Don't know how to test | Write wished-for API. Write assertion first. Ask your human partner. |
 | Test too complicated | Design too complicated. Simplify interface. |
 | Must mock everything | Code too coupled. Use dependency injection. |
 | Test setup huge | Extract helpers. Still complex? Simplify design. |
 ## Debugging Integration
 Bug found? Write failing test reproducing it. Follow TDD cycle. Test proves fix and prevents regression.
 Never fix bugs without a test.
 ## Testing Anti-Patterns
 When adding mocks or test utilities, read @testing-anti-patterns.md to avoid common pitfalls:
 - Testing mock behavior instead of real behavior
 - Adding test-only methods to production classes
 - Mocking without understanding dependencies
 ## Final Rule
 ```
 Production code → test exists and failed first
 Otherwise → not TDD
 ```
 No exceptions without your human partner's permission.
--- a/skills/test-driven-development/testing-anti-patterns.md
+++ b/skills/test-driven-development/testing-anti-patterns.md
@ -0,0 +1,299 @@
 # Testing Anti-Patterns
 **Load this reference when:** writing or changing tests, adding mocks, or tempted to add test-only methods to production code.
 ## Overview
 Tests must verify real behavior, not mock behavior. Mocks are a means to isolate, not the thing being tested.
 **Core principle:** Test what the code does, not what the mocks do.
 **Following strict TDD prevents these anti-patterns.**
 ## The Iron Laws
 ```
 1. NEVER test mock behavior
 2. NEVER add test-only methods to production classes
 3. NEVER mock without understanding dependencies
 ```
 ## Anti-Pattern 1: Testing Mock Behavior
 **The violation:**
 ```typescript
 // ❌ BAD: Testing that the mock exists
 test('renders sidebar', () => {
  render(<Page />);
  expect(screen.getByTestId('sidebar-mock')).toBeInTheDocument();
 });
 ```
 **Why this is wrong:**
 - You're verifying the mock works, not that the component works
 - Test passes when mock is present, fails when it's not
 - Tells you nothing about real behavior
 **your human partner's correction:** "Are we testing the behavior of a mock?"
 **The fix:**
 ```typescript
 // ✅ GOOD: Test real component or don't mock it
 test('renders sidebar', () => {
  render(<Page />);  // Don't mock sidebar
  expect(screen.getByRole('navigation')).toBeInTheDocument();
 });
 // OR if sidebar must be mocked for isolation:
 // Don't assert on the mock - test Page's behavior with sidebar present
 ```
 ### Gate Function
 ```
 BEFORE asserting on any mock element:
  Ask: "Am I testing real component behavior or just mock existence?"
  IF testing mock existence:
    STOP - Delete the assertion or unmock the component
  Test real behavior instead
 ```
 ## Anti-Pattern 2: Test-Only Methods in Production
 **The violation:**
 ```typescript
 // ❌ BAD: destroy() only used in tests
 class Session {
  async destroy() {  // Looks like production API!
    await this._workspaceManager?.destroyWorkspace(this.id);
    // ... cleanup
  }
 }
 // In tests
 afterEach(() => session.destroy());
 ```
 **Why this is wrong:**
 - Production class polluted with test-only code
 - Dangerous if accidentally called in production
 - Violates YAGNI and separation of concerns
 - Confuses object lifecycle with entity lifecycle
 **The fix:**
 ```typescript
 // ✅ GOOD: Test utilities handle test cleanup
 // Session has no destroy() - it's stateless in production
 // In test-utils/
 export async function cleanupSession(session: Session) {
  const workspace = session.getWorkspaceInfo();
  if (workspace) {
    await workspaceManager.destroyWorkspace(workspace.id);
  }
 }
 // In tests
 afterEach(() => cleanupSession(session));
 ```
 ### Gate Function
 ```
 BEFORE adding any method to production class:
  Ask: "Is this only used by tests?"
  IF yes:
    STOP - Don't add it
    Put it in test utilities instead
  Ask: "Does this class own this resource's lifecycle?"
  IF no:
    STOP - Wrong class for this method
 ```
 ## Anti-Pattern 3: Mocking Without Understanding
 **The violation:**
 ```typescript
 // ❌ BAD: Mock breaks test logic
 test('detects duplicate server', () => {
  // Mock prevents config write that test depends on!
  vi.mock('ToolCatalog', () => ({
    discoverAndCacheTools: vi.fn().mockResolvedValue(undefined)
  }));
  await addServer(config);
  await addServer(config);  // Should throw - but won't!
 });
 ```
 **Why this is wrong:**
 - Mocked method had side effect test depended on (writing config)
 - Over-mocking to "be safe" breaks actual behavior
 - Test passes for wrong reason or fails mysteriously
 **The fix:**
 ```typescript
 // ✅ GOOD: Mock at correct level
 test('detects duplicate server', () => {
  // Mock the slow part, preserve behavior test needs
  vi.mock('MCPServerManager'); // Just mock slow server startup
  await addServer(config);  // Config written
  await addServer(config);  // Duplicate detected ✓
 });
 ```
 ### Gate Function
 ```
 BEFORE mocking any method:
  STOP - Don't mock yet
  1. Ask: "What side effects does the real method have?"
  2. Ask: "Does this test depend on any of those side effects?"
  3. Ask: "Do I fully understand what this test needs?"
  IF depends on side effects:
    Mock at lower level (the actual slow/external operation)
    OR use test doubles that preserve necessary behavior
    NOT the high-level method the test depends on
  IF unsure what test depends on:
    Run test with real implementation FIRST
    Observe what actually needs to happen
    THEN add minimal mocking at the right level
  Red flags:
    - "I'll mock this to be safe"
    - "This might be slow, better mock it"
    - Mocking without understanding the dependency chain
 ```
 ## Anti-Pattern 4: Incomplete Mocks
 **The violation:**
 ```typescript
 // ❌ BAD: Partial mock - only fields you think you need
 const mockResponse = {
  status: 'success',
  data: { userId: '123', name: 'Alice' }
  // Missing: metadata that downstream code uses
 };
 // Later: breaks when code accesses response.metadata.requestId
 ```
 **Why this is wrong:**
 - **Partial mocks hide structural assumptions** - You only mocked fields you know about
 - **Downstream code may depend on fields you didn't include** - Silent failures
 - **Tests pass but integration fails** - Mock incomplete, real API complete
 - **False confidence** - Test proves nothing about real behavior
 **The Iron Rule:** Mock the COMPLETE data structure as it exists in reality, not just fields your immediate test uses.
 **The fix:**
 ```typescript
 // ✅ GOOD: Mirror real API completeness
 const mockResponse = {
  status: 'success',
  data: { userId: '123', name: 'Alice' },
  metadata: { requestId: 'req-789', timestamp: 1234567890 }
  // All fields real API returns
 };
 ```
 ### Gate Function
 ```
 BEFORE creating mock responses:
  Check: "What fields does the real API response contain?"
  Actions:
    1. Examine actual API response from docs/examples
    2. Include ALL fields system might consume downstream
    3. Verify mock matches real response schema completely
  Critical:
    If you're creating a mock, you must understand the ENTIRE structure
    Partial mocks fail silently when code depends on omitted fields
  If uncertain: Include all documented fields
 ```
 ## Anti-Pattern 5: Integration Tests as Afterthought
 **The violation:**
 ```
 ✅ Implementation complete
 ❌ No tests written
 "Ready for testing"
 ```
 **Why this is wrong:**
 - Testing is part of implementation, not optional follow-up
 - TDD would have caught this
 - Can't claim complete without tests
 **The fix:**
 ```
 TDD cycle:
 1. Write failing test
 2. Implement to pass
 3. Refactor
 4. THEN claim complete
 ```
 ## When Mocks Become Too Complex
 **Warning signs:**
 - Mock setup longer than test logic
 - Mocking everything to make test pass
 - Mocks missing methods real components have
 - Test breaks when mock changes
 **your human partner's question:** "Do we need to be using a mock here?"
 **Consider:** Integration tests with real components often simpler than complex mocks
 ## TDD Prevents These Anti-Patterns
 **Why TDD helps:**
 1. **Write test first** → Forces you to think about what you're actually testing
 2. **Watch it fail** → Confirms test tests real behavior, not mocks
 3. **Minimal implementation** → No test-only methods creep in
 4. **Real dependencies** → You see what the test actually needs before mocking
 **If you're testing mock behavior, you violated TDD** - you added mocks without watching test fail against real code first.
 ## Quick Reference
 | Anti-Pattern | Fix |
 |--------------|-----|
 | Assert on mock elements | Test real component or unmock it |
 | Test-only methods in production | Move to test utilities |
 | Mock without understanding | Understand dependencies first, mock minimally |
 | Incomplete mocks | Mirror real API completely |
 | Tests as afterthought | TDD - tests first |
 | Over-complex mocks | Consider integration tests |
 ## Red Flags
 - Assertion checks for `*-mock` test IDs
 - Methods only called in test files
 - Mock setup is >50% of test
 - Test fails when you remove mock
 - Can't explain why mock is needed
 - Mocking "just to be safe"
 ## The Bottom Line
 **Mocks are tools to isolate, not things to test.**
 If TDD reveals you're testing mock behavior, you've gone wrong.
 Fix: Test real behavior or question why you're mocking at all.
--- a/skills/verification-before-completion/SKILL.md
+++ b/skills/verification-before-completion/SKILL.md
@ -0,0 +1,139 @@
 ---
 name: verification-before-completion
 description: Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always
 ---
 # Verification Before Completion
 ## Overview
 Claiming work is complete without verification is dishonesty, not efficiency.
 **Core principle:** Evidence before claims, always.
 **Violating the letter of this rule is violating the spirit of this rule.**
 ## The Iron Law
 ```
 NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE
 ```
 If you haven't run the verification command in this message, you cannot claim it passes.
 ## The Gate Function
 ```
 BEFORE claiming any status or expressing satisfaction:
 1. IDENTIFY: What command proves this claim?
 2. RUN: Execute the FULL command (fresh, complete)
 3. READ: Full output, check exit code, count failures
 4. VERIFY: Does output confirm the claim?
   - If NO: State actual status with evidence
   - If YES: State claim WITH evidence
 5. ONLY THEN: Make the claim
 Skip any step = lying, not verifying
 ```
 ## Common Failures
 | Claim | Requires | Not Sufficient |
 |-------|----------|----------------|
 | Tests pass | Test command output: 0 failures | Previous run, "should pass" |
 | Linter clean | Linter output: 0 errors | Partial check, extrapolation |
 | Build succeeds | Build command: exit 0 | Linter passing, logs look good |
 | Bug fixed | Test original symptom: passes | Code changed, assumed fixed |
 | Regression test works | Red-green cycle verified | Test passes once |
 | Agent completed | VCS diff shows changes | Agent reports "success" |
 | Requirements met | Line-by-line checklist | Tests passing |
 ## Red Flags - STOP
 - Using "should", "probably", "seems to"
 - Expressing satisfaction before verification ("Great!", "Perfect!", "Done!", etc.)
 - About to commit/push/PR without verification
 - Trusting agent success reports
 - Relying on partial verification
 - Thinking "just this once"
 - Tired and wanting work over
 - **ANY wording implying success without having run verification**
 ## Rationalization Prevention
 | Excuse | Reality |
 |--------|---------|
 | "Should work now" | RUN the verification |
 | "I'm confident" | Confidence ≠ evidence |
 | "Just this once" | No exceptions |
 | "Linter passed" | Linter ≠ compiler |
 | "Agent said success" | Verify independently |
 | "I'm tired" | Exhaustion ≠ excuse |
 | "Partial check is enough" | Partial proves nothing |
 | "Different words so rule doesn't apply" | Spirit over letter |
 ## Key Patterns
 **Tests:**
 ```
 ✅ [Run test command] [See: 34/34 pass] "All tests pass"
 ❌ "Should pass now" / "Looks correct"
 ```
 **Regression tests (TDD Red-Green):**
 ```
 ✅ Write → Run (pass) → Revert fix → Run (MUST FAIL) → Restore → Run (pass)
 ❌ "I've written a regression test" (without red-green verification)
 ```
 **Build:**
 ```
 ✅ [Run build] [See: exit 0] "Build passes"
 ❌ "Linter passed" (linter doesn't check compilation)
 ```
 **Requirements:**
 ```
 ✅ Re-read plan → Create checklist → Verify each → Report gaps or completion
 ❌ "Tests pass, phase complete"
 ```
 **Agent delegation:**
 ```
 ✅ Agent reports success → Check VCS diff → Verify changes → Report actual state
 ❌ Trust agent report
 ```
 ## Why This Matters
 From 24 failure memories:
 - your human partner said "I don't believe you" - trust broken
 - Undefined functions shipped - would crash
 - Missing requirements shipped - incomplete features
 - Time wasted on false completion → redirect → rework
 - Violates: "Honesty is a core value. If you lie, you'll be replaced."
 ## When To Apply
 **ALWAYS before:**
 - ANY variation of success/completion claims
 - ANY expression of satisfaction
 - ANY positive statement about work state
 - Committing, PR creation, task completion
 - Moving to next task
 - Delegating to agents
 **Rule applies to:**
 - Exact phrases
 - Paraphrases and synonyms
 - Implications of success
 - ANY communication suggesting completion/correctness
 ## The Bottom Line
 **No shortcuts for verification.**
 Run the command. Read the output. THEN claim the result.
 This is non-negotiable.
--- a/skills/writing-plans/SKILL.md
+++ b/skills/writing-plans/SKILL.md
@ -0,0 +1,152 @@
 ---
 name: writing-plans
 description: Use when you have a spec or requirements for a multi-step task, before touching code
 ---
 # Writing Plans
 ## Overview
 Write comprehensive implementation plans assuming the engineer has zero context for our codebase and questionable taste. Document everything they need to know: which files to touch for each task, code, testing, docs they might need to check, how to test it. Give them the whole plan as bite-sized tasks. DRY. YAGNI. TDD. Frequent commits.
 Assume they are a skilled developer, but know almost nothing about our toolset or problem domain. Assume they don't know good test design very well.
 **Announce at start:** "I'm using the writing-plans skill to create the implementation plan."
 **Context:** If working in an isolated worktree, it should have been created via the `superpowers:using-git-worktrees` skill at execution time.
 **Save plans to:** `docs/superpowers/plans/YYYY-MM-DD-<feature-name>.md`
 - (User preferences for plan location override this default)
 ## Scope Check
 If the spec covers multiple independent subsystems, it should have been broken into sub-project specs during brainstorming. If it wasn't, suggest breaking this into separate plans — one per subsystem. Each plan should produce working, testable software on its own.
 ## File Structure
 Before defining tasks, map out which files will be created or modified and what each one is responsible for. This is where decomposition decisions get locked in.
 - Design units with clear boundaries and well-defined interfaces. Each file should have one clear responsibility.
 - You reason best about code you can hold in context at once, and your edits are more reliable when files are focused. Prefer smaller, focused files over large ones that do too much.
 - Files that change together should live together. Split by responsibility, not by technical layer.
 - In existing codebases, follow established patterns. If the codebase uses large files, don't unilaterally restructure - but if a file you're modifying has grown unwieldy, including a split in the plan is reasonable.
 This structure informs the task decomposition. Each task should produce self-contained changes that make sense independently.
 ## Bite-Sized Task Granularity
 **Each step is one action (2-5 minutes):**
 - "Write the failing test" - step
 - "Run it to make sure it fails" - step
 - "Implement the minimal code to make the test pass" - step
 - "Run the tests and make sure they pass" - step
 - "Commit" - step
 ## Plan Document Header
 **Every plan MUST start with this header:**
 ```markdown
 # [Feature Name] Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** [One sentence describing what this builds]
 **Architecture:** [2-3 sentences about approach]
 **Tech Stack:** [Key technologies/libraries]
 ---
 ```
 ## Task Structure
 ````markdown
 ### Task N: [Component Name]
 **Files:**
 - Create: `exact/path/to/file.py`
 - Modify: `exact/path/to/existing.py:123-145`
 - Test: `tests/exact/path/to/test.py`
 - [ ] **Step 1: Write the failing test**
 ```python
 def test_specific_behavior():
    result = function(input)
    assert result == expected
 ```
 - [ ] **Step 2: Run test to verify it fails**
 Run: `pytest tests/path/test.py::test_name -v`
 Expected: FAIL with "function not defined"
 - [ ] **Step 3: Write minimal implementation**
 ```python
 def function(input):
    return expected
 ```
 - [ ] **Step 4: Run test to verify it passes**
 Run: `pytest tests/path/test.py::test_name -v`
 Expected: PASS
 - [ ] **Step 5: Commit**
 ```bash
 git add tests/path/test.py src/path/file.py
 git commit -m "feat: add specific feature"
 ```
 ````
 ## No Placeholders
 Every step must contain the actual content an engineer needs. These are **plan failures** — never write them:
 - "TBD", "TODO", "implement later", "fill in details"
 - "Add appropriate error handling" / "add validation" / "handle edge cases"
 - "Write tests for the above" (without actual test code)
 - "Similar to Task N" (repeat the code — the engineer may be reading tasks out of order)
 - Steps that describe what to do without showing how (code blocks required for code steps)
 - References to types, functions, or methods not defined in any task
 ## Remember
 - Exact file paths always
 - Complete code in every step — if a step changes code, show the code
 - Exact commands with expected output
 - DRY, YAGNI, TDD, frequent commits
 ## Self-Review
 After writing the complete plan, look at the spec with fresh eyes and check the plan against it. This is a checklist you run yourself — not a subagent dispatch.
 **1. Spec coverage:** Skim each section/requirement in the spec. Can you point to a task that implements it? List any gaps.
 **2. Placeholder scan:** Search your plan for red flags — any of the patterns from the "No Placeholders" section above. Fix them.
 **3. Type consistency:** Do the types, method signatures, and property names you used in later tasks match what you defined in earlier tasks? A function called `clearLayers()` in Task 3 but `clearFullLayers()` in Task 7 is a bug.
 If you find issues, fix them inline. No need to re-review — just fix and move on. If you find a spec requirement with no task, add the task.
 ## Execution Handoff
 After saving the plan, offer execution choice:
 **"Plan complete and saved to `docs/superpowers/plans/<filename>.md`. Two execution options:**
 **1. Subagent-Driven (recommended)** - I dispatch a fresh subagent per task, review between tasks, fast iteration
 **2. Inline Execution** - Execute tasks in this session using executing-plans, batch execution with checkpoints
 **Which approach?"**
 **If Subagent-Driven chosen:**
 - **REQUIRED SUB-SKILL:** Use superpowers:subagent-driven-development
 - Fresh subagent per task + two-stage review
 **If Inline Execution chosen:**
 - **REQUIRED SUB-SKILL:** Use superpowers:executing-plans
 - Batch execution with checkpoints for review
--- a/skills/writing-plans/plan-document-reviewer-prompt.md
+++ b/skills/writing-plans/plan-document-reviewer-prompt.md
@ -0,0 +1,49 @@
 # Plan Document Reviewer Prompt Template
 Use this template when dispatching a plan document reviewer subagent.
 **Purpose:** Verify the plan is complete, matches the spec, and has proper task decomposition.
 **Dispatch after:** The complete plan is written.
 ```
 Task tool (general-purpose):
  description: "Review plan document"
  prompt: |
    You are a plan document reviewer. Verify this plan is complete and ready for implementation.
    **Plan to review:** [PLAN_FILE_PATH]
    **Spec for reference:** [SPEC_FILE_PATH]
    ## What to Check
    | Category | What to Look For |
    |----------|------------------|
    | Completeness | TODOs, placeholders, incomplete tasks, missing steps |
    | Spec Alignment | Plan covers spec requirements, no major scope creep |
    | Task Decomposition | Tasks have clear boundaries, steps are actionable |
    | Buildability | Could an engineer follow this plan without getting stuck? |
    ## Calibration
    **Only flag issues that would cause real problems during implementation.**
    An implementer building the wrong thing or getting stuck is an issue.
    Minor wording, stylistic preferences, and "nice to have" suggestions are not.
    Approve unless there are serious gaps — missing requirements from the spec,
    contradictory steps, placeholder content, or tasks so vague they can't be acted on.
    ## Output Format
    ## Plan Review
    **Status:** Approved | Issues Found
    **Issues (if any):**
    - [Task X, Step Y]: [specific issue] - [why it matters for implementation]
    **Recommendations (advisory, do not block approval):**
    - [suggestions for improvement]
 ```
 **Reviewer returns:** Status, Issues (if any), Recommendations