You're offline — showing cached data

MC-4683

Pool slot release leaves git worktree on stale branch → claim() wedges → board dispatch outage
2026-06-13 08:50:02 SAST
Home Board MC-4683

Pool slot release leaves git worktree on stale branch → claim() wedges → board dispatch outage

OUTAGE 2026-06-03 ~21:00-21:25: ticket-pickup timed out (180s) every cycle, dispatching nothing → board-wide dispatch outage (CRITICAL fired repeatedly). ROOT CAUSE: worktree-p...
State Done Next Action Closed Owner Luci Runtime Closed Age 9d ago
MC-4683
Ticket is done; runtime is closed. · cwd /home/lucienne/workspace/state/control-room-worktrees/mc-4683-pool-slot-release-leaves-git-worktree-on-3a26e2 · uptime 9d 0h · last activity 9d 0h ago

Description

MC-4683
OUTAGE 2026-06-03 ~21:00-21:25: ticket-pickup timed out (180s) every cycle, dispatching nothing → board-wide dispatch outage (CRITICAL fired repeatedly). ROOT CAUSE: worktree-pool slots can desync between JSON state and git worktree. The .pool-state/slot-N.json said status=free for all 3 slots, but the git worktrees pool-1/pool-2 were still checked out on stale slot branches (slot1/mc-MC-4634, slot2/mc-MC-4635) from prior claims — only pool-0 was in the correct clean detached-HEAD state. worktree_pool.claim() accepts a slot only when git matches the clean/free state, so it REJECTED pool-1/pool-2, found effectively 1 free slot for 5 todo tickets, and blocked in its claim wait (default timeout=600s, MC-4141 single-flight) — far past the scheduler's 180s task cap → killed → 0 dispatched. Recurred every cycle. IMMEDIATE RECOVERY (done, manually): cleared 3 stale .luci-janitor.lock files (10-day-old, empty, from luci_ticket_auditor.py) and M .acsa-monthly-state.json M .gdrive-cache/Data Fuel.xlsx M audio-library/ceo/_insights/2026-06-02.json M ccgram M codex-validation M dashboard-build/ACSA Monthly Data.xlsx M dashboard-build/Booking Curves/202606 Booking Curve.csv M dashboard-build/Dashboard.html M dashboard-build/Data Fuel.xlsx M dashboard-build/SummaryExcel.xlsx M dashboard-build/SummaryRoute.xlsx M dashboard-build/dashboard/Dashboard.html M dashboard-build/dashboard/dashboard_data.json M dashboard-build/dashboard/dashboard_summary.md M dashboard-build/scripts/build_html_dashboard.py M luci-manifest.md M projects/btc-swing-trader/charts/btc_1D.png M projects/btc-swing-trader/charts/btc_1h.png M projects/btc-swing-trader/charts/btc_4h.png M state/control-room/luci-pulse.json M state/motion_sensor_suspects.json M state/spend/spend.json M state/swing_verdict.json M tasks/needs-input-pickup.md M tasks/ticket-pickup.md M tasks/whatsapp-health-check.md M worktrees/.claude/worktrees/pool-0 M worktrees/.claude/worktrees/pool-1 on pool-1/pool-2 to match pool-0. Dispatch resumed immediately (MC-4642→in_progress, MC-4633 claimed pool-0). FIXES NEEDED (council): 1. release()/reset path MUST leave the worktree in clean detached-HEAD on origin/master (same state pool-0 was in) — NOT on the stale slot branch. The desync is the release path not fully resetting git. (worktree_pool.py release ~L794 / _reset_slot ~L46.) 2. claim() should self-heal a state=free-but-git-dirty slot by resetting it, instead of rejecting + blocking the full 600s. 3. claim() blocking timeout (600s) must be < the scheduler task timeout (180s) so a genuinely-full pool fails fast + retries next cycle instead of being SIGKILLed mid-claim (SIGKILL risks leaving locks/half-claims). 4. pickup should dispatch what fits the free slots and leave the rest for the next cycle (non-blocking), not block trying to place more tickets than slots in one 180s run. 5. Consider bumping tasks/ticket-pickup.md timeout 180→300 as interim, but #1-#4 are the real fix. Belongs with the dispatch-hardening cluster (MC-4534/4668/4681/4631). This is the most severe — it takes the whole board down, not just one ticket.

Activity

done
Luci is working...
Live
No activity yet
Help