Ticket is done; runtime is closed.·cwd /home/lucienne/workspace/state/control-room-worktrees/mc-4683-pool-slot-release-leaves-git-worktree-on-3a26e2 · uptime 9d 0h · last activity 9d 0h ago
Description
MC-4683
OUTAGE 2026-06-03 ~21:00-21:25: ticket-pickup timed out (180s) every cycle, dispatching nothing → board-wide dispatch outage (CRITICAL fired repeatedly).
ROOT CAUSE: worktree-pool slots can desync between JSON state and git worktree. The .pool-state/slot-N.json said status=free for all 3 slots, but the git worktrees pool-1/pool-2 were still checked out on stale slot branches (slot1/mc-MC-4634, slot2/mc-MC-4635) from prior claims — only pool-0 was in the correct clean detached-HEAD state. worktree_pool.claim() accepts a slot only when git matches the clean/free state, so it REJECTED pool-1/pool-2, found effectively 1 free slot for 5 todo tickets, and blocked in its claim wait (default timeout=600s, MC-4141 single-flight) — far past the scheduler's 180s task cap → killed → 0 dispatched. Recurred every cycle.
IMMEDIATE RECOVERY (done, manually): cleared 3 stale .luci-janitor.lock files (10-day-old, empty, from luci_ticket_auditor.py) and M .acsa-monthly-state.json
M .gdrive-cache/Data Fuel.xlsx
M audio-library/ceo/_insights/2026-06-02.json
M ccgram
M codex-validation
M dashboard-build/ACSA Monthly Data.xlsx
M dashboard-build/Booking Curves/202606 Booking Curve.csv
M dashboard-build/Dashboard.html
M dashboard-build/Data Fuel.xlsx
M dashboard-build/SummaryExcel.xlsx
M dashboard-build/SummaryRoute.xlsx
M dashboard-build/dashboard/Dashboard.html
M dashboard-build/dashboard/dashboard_data.json
M dashboard-build/dashboard/dashboard_summary.md
M dashboard-build/scripts/build_html_dashboard.py
M luci-manifest.md
M projects/btc-swing-trader/charts/btc_1D.png
M projects/btc-swing-trader/charts/btc_1h.png
M projects/btc-swing-trader/charts/btc_4h.png
M state/control-room/luci-pulse.json
M state/motion_sensor_suspects.json
M state/spend/spend.json
M state/swing_verdict.json
M tasks/needs-input-pickup.md
M tasks/ticket-pickup.md
M tasks/whatsapp-health-check.md
M worktrees/.claude/worktrees/pool-0
M worktrees/.claude/worktrees/pool-1 on pool-1/pool-2 to match pool-0. Dispatch resumed immediately (MC-4642→in_progress, MC-4633 claimed pool-0).
FIXES NEEDED (council):
1. release()/reset path MUST leave the worktree in clean detached-HEAD on origin/master (same state pool-0 was in) — NOT on the stale slot branch. The desync is the release path not fully resetting git. (worktree_pool.py release ~L794 / _reset_slot ~L46.)
2. claim() should self-heal a state=free-but-git-dirty slot by resetting it, instead of rejecting + blocking the full 600s.
3. claim() blocking timeout (600s) must be < the scheduler task timeout (180s) so a genuinely-full pool fails fast + retries next cycle instead of being SIGKILLed mid-claim (SIGKILL risks leaving locks/half-claims).
4. pickup should dispatch what fits the free slots and leave the rest for the next cycle (non-blocking), not block trying to place more tickets than slots in one 180s run.
5. Consider bumping tasks/ticket-pickup.md timeout 180→300 as interim, but #1-#4 are the real fix.
Belongs with the dispatch-hardening cluster (MC-4534/4668/4681/4631). This is the most severe — it takes the whole board down, not just one ticket.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· Critical
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
api (human)
Mission Control
9d ago
9d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.direct_worker_done_recoveredcwd /home/lucienne/workspace/state/control-room-worktrees/mc-4683-pool-slot-release-leaves-git-worktree-on-3a26e2 · uptime 9d 0h · last activity 9d 0h agoMC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.Raw console: luci-controller · claude-code
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
Clean record (prior description was mangled by shell backticks in the CLI arg — ironically the exact bug MC-4655 fixes).
IMMEDIATE RECOVERY applied (board now dispatching): removed 3 stale empty .luci-janitor.lock files in the pool dirs, and reset pool-1 + pool-2 worktrees to clean detached origin/master (matching pool-0). Dispatch resumed: MC-4642 went in_progress, MC-4633 claimed pool-0.
DEEPER FINDINGS beyond the pool desync:
1. tasks/ticket-pickup.md is enabled:false (disabled 2026-05-30 to stop duplicate workers vs the Control Room watcher) YET the scheduler is still executing it and it timed out repeatedly. A disabled task is running — scheduler enabled-state bug OR a second invocation path.
2. FIVE task files invoke mc_pickup.py (ticket-pickup, needs-input-pickup [60s], pickup-watchdog, pool-health, mc-auto-review) — overlapping dispatchers all hitting the same pool/claim.
3. claim() default timeout=600s >> task timeout=180s, so a full pool blocks claim and the run is SIGKILLed mid-claim — which is what leaves the pool in the desynced state in finding (pool desync), a self-reinforcing loop.
4. The MC-4640 double-rebuild (two cr/ branches) is the duplicate-dispatch this disable was meant to prevent — evidence the tangle is actively causing dupes.
5. Control Room watcher only ALERTS, does not dispatch (MC-4631) — so the legacy pickup is the de-facto dispatcher despite being disabled.
This is a dispatch-layer cluster: MC-4683 (pool desync/outage) + MC-4631 (watcher only-alerts) + the disabled-task-still-running + overlapping pickup tasks. Needs ONE focused consolidation pass (single owner for dispatch, pool release leaves slots clean, claim timeout < task timeout, non-blocking dispatch-what-fits). NOT safe to keep patching live at night. Recommend daytime focused sprint with the MC-4534/4668/4681 hardening tickets.
api9d ago
Recurred this morning (pool-0 desynced again: state=free but worktree dirty on slot0/mc-MC-4633 leftover scratch from the now-closed ticket). Applied immediate relief: reset all 3 slots to clean detached origin/master. Board dispatching again. Manual resets will NOT hold (release-doesnt-clean recurs every cycle), so dispatched the durable fix as a controlled isolated-worktree build: release leaves slot clean-detached, claim self-heals desynced free slots, and claim timeout made strictly less than the 180s task timeout so pickup is never SIGKILLed mid-claim (the SIGKILL is what re-desyncs the slot — self-reinforcing loop). Build will push feat/mc-4683-pool-desync-fix for controller merge+restart.
api9d ago
DIAGNOSIS COMPLETE (build subagent), fix fully specified but NOT yet applied — the isolation:worktree harness is broken for code work (Write tool overlay is on a different filesystem than Bash, so edit-then-commit fails at every path; 8+ attempts, stopped per loop circuit-breaker; no live infra touched, feat/ branch empty).
ROOT-CAUSE TRIGGER confirmed: MC_POOL_CLAIM_TIMEOUT=30 is set in /etc/systemd/system/luci-dashboard.service.d/pool.conf and documented in wiki, but NOTHING READS IT. worktree_pool.claim() hardcodes timeout=600.0 (line 640) and mc_pickup.py passes 600.0 (line 5583). 600 > task timeout 180 => SIGKILL mid-claim => half-reset slot => desync.
EXACT FIX (4 parts):
F3 (trigger): claim() default timeout=None then read float(os.environ.get(MC_POOL_CLAIM_TIMEOUT, 120)); mc_pickup.py drop the hardcoded 600 so env applies.
F1 (release-clean): add _force_clean_detached(slot_path, base) helper (abort any rebase/cherry-pick/merge/revert/am/bisect, reset --hard HEAD, checkout --detach base, reset --hard base, clean -fdx with CLEAN_EXCLUDES) and call it in BOTH _force_free paths — the light path (worker_pid is None, ~line 987) currently SKIPS the reset, which is exactly how a status=free slot is left dirty/on-branch.
F2 (claim self-heal): in claim() after selecting a free slot, before _reset_slot (~line 684), call _force_clean_detached so a free slot is always claimable.
F4: tests/test_worktree_pool_desync_mc4683.py — (a) release leaves detached-HEAD at base, porcelain empty, no slotN branch; (b) dirty+leftover-branch free slot, claim succeeds; (c) all slots active + MC_POOL_CLAIM_TIMEOUT=2 => claim raises PoolTimeout in <task-timeout. Temp/mock pool only.
Re-dispatching to apply (non-isolation, real worktree). Files: worktree_pool.py, mc_pickup.py, new test, plan doc.
luci-followup-sweep9d ago
[follow-up] 0 min past expected check-in (2026-06-04T06:32:27.656609+00:00 UTC).
api9d ago
Controller deploy + sign-off: LANDED, live, CLOSED. Merged feat/mc-4683-pool-desync-fix (7308d3e6) to master, restarted luci-dashboard. 54 tests pass. F3 confirmed live (worktree_pool.py:650 reads MC_POOL_CLAIM_TIMEOUT). Recurring ticket-pickup CRITICAL: RESOLVED — last timeout 08:35:50, none since. The fix: release/reclaim now leave slots clean-detached (F1, both _force_free paths incl the light path that skipped reset), claim() self-heals desynced free slots (F2), claim honors the env timeout so it fails fast under the 180s task cap instead of being SIGKILLed mid-claim (F3), and a council-caught concurrency fix split _force_free so git I/O runs unlocked (Q3 — prevents the fix re-wedging under the global lock). Note: close was briefly blocked by MC-4681 guard checking an abandoned cr/ attempt branch; deleted that stale ref and the guard correctly passed (work is on master). Two guard refinements noted for follow-up.
Controller decision
luci-controller9d ago
[control-room-recover] MC-4683: cleared false manual_safe_dispatch_required blocker caused by controller pool-claim / unsafe-main-checkout failure (reason: 'unsafe_main_checkout_runtime: pool claim timeout for MC-4683; refusing unsafe runtime cwd /home/lucienne/workspace/mission-control'). Requeued to todo; Control Room pickup now owns retry/dispatch. No human reply was pending.
luci-controller9d ago
[control-room-dispatch] Control Room dispatched MC-4683 to a Claude Code worker.
Worktree: /home/lucienne/workspace/state/control-room-worktrees/mc-4683-pool-slot-release-leaves-git-worktree-on-3a26e2
Branch: cr/mc-4683-pool-slot-release-leaves-git-worktree-on-3a26e2
tmux: cr-MC-4683
Expected check-in: 2026-06-04T06:32:27.656609+00:00
luci-controller9d ago
[controller-gate] Controller gate closed: MC-4683 pool/claim outage fix is live on origin/master (merge 91100688; fix commit 7308d3e6 ancestor of origin/master). Legacy pickup disabled; no live legacy mc_pickup remains.