wiki-project-compile overran hard timeout: ran 5326s vs timeout 3600s
On 2026-06-02 wiki-project-compile (Tue 11:00 weekly) started 11:02:56, finished 12:31:42 = 5326s, but frontmatter timeout: 3600 and the script has a 55-min soft-budget bail. BO...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge11d ago
Ticket is done; runtime is closed.·profile claude_opus_1m_medium · cwd /home/lucienne/workspace/.claude/worktrees/pool-0 · uptime 10d 21h · last activity 10d 19h ago
Description
MC-4603
On 2026-06-02 wiki-project-compile (Tue 11:00 weekly) started 11:02:56, finished 12:31:42 = 5326s, but frontmatter timeout: 3600 and the script has a 55-min soft-budget bail. BOTH failed to bound it. Scheduler runs serially under a non-blocking flock, so this 88-min run starved every task behind it 11:02->12:31 (systemd-watchdog, scheduler-watchdog delayed to catch-up) -> produced false missed-fire ticket MC-4601 (now fixed at watchdog layer via _scheduler_blocked_across in scripts/scheduler_watchdog.py). ROOT cause to fix here: (1) why communicate(timeout=3600) in scheduler.execute_command did NOT kill at 3600s (logged 'completed' not 'timeout'); (2) why the 55-min soft budget in wiki_project_compile.sh did not bail; (3) consider running long weekly tasks detached so they don't serially starve hourly tasks. See MC-4601.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· Medium
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
luci (luci)
infra
11d ago
11d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.ticket_runtime_doneprofile claude_opus_1m_medium · cwd /home/lucienne/workspace/.claude/worktrees/pool-0 · uptime 10d 21h · last activity 10d 19h agoMC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.Raw console: luci · claude/anthropic/opus[1m]
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
Root-caused the 5326s overrun to serial-foreground self-heal CHAIN stacking (attempt-1 fail → claude diagnose → rerun, all timed from one start; each execute_command bounded by 3600s but the aggregate retry/heal chain was not) running in the serial tick → 88-min HOL starvation. Fixed 3 ways: (1) wiki-project-compile.md background:true+self_heal:false (detached, no retry/heal stacking); (2) scheduler.kill_stale group-kills background locks via killpg SIGTERM→SIGKILL so the stale-lock reaper bounds the whole tree at timeout+60s (no orphaned claude, MC-3640 fix); (3) wiki_project_compile.sh per-page claude -p hard-capped with `timeout` (soft budget only re-checked between pages). Smoke: kill_stale reaps a real detached process group; yaml/ast/load validations pass. Deployed: 380f6138 ff to workspace/master, 78c4ecb pushed to conrelma/claude.
luci11d ago
Workflow action `review_ready` applied. Work ready for review.
api11d ago
Controller sign-off: VERIFIED done. Reviewer verdict 'fail (2 of 3)' was a cross-repo blind spot — it saw only the workspace diff (380f6138) and missed the .claude-side commit (78c4ecb, conrelma/claude). All three root causes ARE fixed and live:
1. detach: tasks/wiki-project-compile.md background:true + self_heal:false (on master) — kills serial HOL starvation + retry-chain stacking. This is the correct reframe of ticket root cause #1: the 5326s was 3 sequential 3600s-bounded self-heal calls timed from one start, NOT a single communicate() ignoring its timeout. No self-heal chain → no aggregate overrun.
2. group-kill: scheduler.py kill_stale os.killpg SIGTERM→SIGKILL (lines 1950/1957) bounds the detached process tree at timeout+60s — no orphaned claude.
3. per-page hard cap (ticket root cause #2): wiki_project_compile.sh PER_PAGE_TIMEOUT_S=600 + 'timeout' wrapper (line 171) on each claude -p; soft budget re-checked between pages. On conrelma/claude master (78c4ecb).
Local master fast-forwarded to origin (380f6138). The 88-min HOL incident cannot recur. Closing.