Bound aggregate self-heal retry chain for ANY task (not just wiki-compile)
Forward hardening surfaced during MC-4603. The 5326s overrun root cause was the self-heal CHAIN: attempt-1 fail → claude diagnose → rerun, all timed from one start; each execute...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge11d ago
Ticket is done; runtime is closed.·profile claude_opus_1m_medium
Description
MC-4606
Forward hardening surfaced during MC-4603. The 5326s overrun root cause was the self-heal CHAIN: attempt-1 fail → claude diagnose → rerun, all timed from one start; each execute_command is bounded by its 3600s timeout but the AGGREGATE retry/heal chain is not. MC-4603 fixed this for wiki-project-compile specifically (self_heal:false + detach). But any OTHER task with self_heal:true + a long timeout can chain-overrun the same way and (if foreground) starve the serial scheduler.
Fix: in scheduler.py, bound the total self-heal chain wall-clock per task (e.g. chain budget = timeout * (1 + max_heal_attempts), or a single absolute deadline shared across attempt+diagnose+rerun). Add a regression that simulates a 3-attempt heal chain and asserts it's killed at the aggregate budget, not 3x the per-call timeout.
Not urgent — wiki-compile (the only task that hit this) is already detached. This generalizes the guard.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· Low
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
api (human)
infra
11d ago
6d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
MC-4606 implementation evidence:
- Repo: /home/lucienne/workspace (owning scheduler repo)
- Branch: kanban/mc-4606-self-heal-budget (pushed to origin)
- Commit: 375ac735 MC-4606 bound scheduler self-heal budget
- Changed files: scheduler.py; tests/test_scheduler_regressions.py
- Behavior: foreground scheduled task runs now get one aggregate monotonic deadline from task timeout; execute_command and self_heal consume the remaining budget, and run_task checks the deadline before retry/heal/rerun chain steps so the chain stops at the aggregate budget instead of multiplying per-call timeouts.
- Regression: SchedulerRuntimeFailureTests.test_self_heal_retry_chain_respects_aggregate_timeout simulates initial fail + retry fail + heal attempt crossing a 10s budget and asserts the rerun/second heal are skipped and the run records timeout.
- Tests: PYTHONPATH=. python3 -m unittest tests.test_scheduler_regressions -v -> OK (22 tests).
- No cron changes.
luci-board-manager6d ago
Controller action: routed this Luci-owned technical follow-up through Hermes Kanban mc-internal as t_d949d2b3 (codexbuilder). MC remains visibility-only; no MC runtime/pickup was started. The Kanban worker must append commit/test evidence or a blocker here before this MC ticket is closed.
luci-board-manager6d ago
Controller recovery: the first Kanban dispatch for t_d949d2b3 spawned in the live mission-control checkout, so I reclaimed it before accepting work and retargeted the same Kanban card to isolated worktree /home/lucienne/workspace/state/kanban-worktrees/mc-4606-self-heal-budget. Redispatch will run from that worktree; MC remains visibility-only.
luci-board-manager6d ago
Controller gate complete (visibility-only): landed MC-4606 without using MC pickup/runtime. Evidence: worker branch kanban/mc-4606-self-heal-budget commit 375ac735 independently reviewed PASS; because the live top-level workspace had unrelated dirty state, I landed via a clean detached worktree from origin/master and pushed cherry-pick 97ba732151c6b406b271b5fad13b78d07be4df74 to conrelma/luci-workspace master. Verification in the clean landing worktree: `PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. python3 -m unittest tests.test_scheduler_regressions -v` -> 22 tests OK. The scheduler self-heal/retry chain now consumes one aggregate task timeout budget instead of multiplying per-command/heal timeouts. No cron jobs were changed.