You're offline — showing cached data

Luci · flock -n on `/tmp/scheduler.lock` — if a tick holds l

runtime not set · chat:591:session · failed
×
Luci
flock -n on `/tmp/scheduler.lock` — if a tick holds lock long, every subsequent minute skips silently. 1h41m gap = something held lock. Also cgroup has orphan procs + 14G peak mem/4.5G swap = memory pressure. Find what ran long.
Luci
**Root cause found.** `wiki-project-compile` ran 11:02:56 → 12:31:42 = **5326s (1h28m)**. Scheduler runs tasks serially in one tick; tick wrapped in `flock -n /tmp/scheduler.lock`, so every minute-tick during that window couldn't get lock and skipped silently. systemd-watchdog (and all tasks after wiki-project-compile in queue) blocked. Watchdog "miss" = symptom, not cause. Check wiki-project-compile config — does it have a timeout?
Luci
wiki-project-compile: schedule `0 11 * * 2` (Tuesday 11:00 — today). timeout 3600s but ran **5326s** — overran by 28min. Two problems: 1. timeout 3600 NOT enforced (ran 88min). 2. Scheduler serial: one long task blocks whole tick + flock -n blocks next ticks → mass starvation behind it. Check scheduler.py timeout enforcement + serial model.
Luci
Timeout IS wired (`communicate(timeout=task["timeout"])`, killpg on expiry). But run completed at 5326s > 3600s without timeout firing. Either it didn't take this path, or timeout value wasn't 3600. Check the actual run log + how shell tasks dispatch.
Luci
execute_command uses `communicate(timeout=task["timeout"])` = 3600 for this task. But it ran 5326s and completed (no timeout raised). So either timeout wasn't enforced or soft-budget failed. Secondary though. Core issue: head-of-line blocking. Find the watchdog that generated this ticket.
Luci
Missed-fire detector (line 315-348) flags when no task_run since prev_fire+15min grace. It does NOT account for scheduler being blocked by a long in-flight run (HOL blocking). That's the false-positive source. Verify: check sibling missed tickets + when this one was created.
Luci
Confirmed false-positive + self-recovery race: scheduler-watchdog itself was blocked (ran 12:43:53 catch-up), checked systemd-watchdog at 12:43:53, but systemd-watchdog ran 12:43:56 — **3 seconds later** in same recovery tick. wiki-project-compile (11:02→12:31) straddled the 12:00 fire. Only 1 false ticket (4601) created. Root cause chain: 1. wiki-project-compile (weekly Tue 11:00) overran 3600s timeout → ran 5326s (88min). 2. Scheduler is serial + `flock -n` → that one task held the lock, starving all tasks behind it 11:02→12:31. 3. Missed-fire watchdog has no straddle/recovery guard → flagged systemd-watchdog as "didn't fire" 3s before it actually ran in the catch-up tick. Fix: add straddle/recovery guard to missed-fire check (mirrors existing recovery guard for execution failures). This is a code change → dev-loop.
Luci
Tier 1: single-file fix (scheduler_watchdog.py) — add straddle/recovery guard to missed-fire check + smoke test. I'm in worktree pool-0 on branch slot0/mc-MC-4601. Read the file head for constants/helpers.
Luci
Now add the straddle guard. Helper + call in the loop. Insert helper after `_task_added_ts`:
Luci
Now a smoke test exercising the guard (straddle → suppress, no-straddle → flag):
Luci
Run smoke test + syntax check + real-data validation against mc.db:
Luci
All pass; real mc.db confirms `wiki-project-compile` straddled 12:00 — guard would have suppressed ticket 4601. One hardening: an ancient zombie run with `finished_at NULL` (codebase notes stale 'running' rows persist for weeks) could suppress forever. Bound the straddle lookback to 6h (covers any legit long task, excludes zombies):
Luci
Add zombie-bound test + re-run all:
Luci
6/6 pass, real-data still detects blocker. Commit + push.
Luci
git add failed — my edits landed in main workspace (`/home/lucienne/workspace/scripts/...`), but the slot branch is checked out at pool-0 (cwd). Check both trees.
Luci
Edits landed in main workspace (master, with unrelated dirty state — not mine to commit). Correct path: apply the 2 files into pool-0 slot branch. Copy them over, commit + push from pool-0.
Luci
Pushed `fee1babd` on slot0/mc-MC-4601. Now revert my stray edits in main workspace (master) so they don't double-apply — the fix lives on the slot branch.
Luci
Main restored clean. Now open a follow-up ticket for the underlying defect (wiki-project-compile overran its 3600s timeout → 5326s, the actual starvation cause) — kept out of this fix's scope per surgical discipline, but shouldn't drop the thread.
Luci
Done. Report completion via mc-coord.
Default runtime
Help