Luci · Task config fine — enabled, schedule correct. Problem
Luci
Task config fine — enabled, schedule correct. Problem is scheduler-side. Gather state.
Luci
Found it. Not a cron-watchdog problem — **scheduler-wide stall**. Whole scheduler dead ~07:01→08:51 (1h50m gap). Every task missed in that window, cron-watchdog included. Scheduler resumed 08:51, healthy now. Need root cause of the stall.
Luci
Gap confirmed — zero task_runs 07:01:30→08:51:00. journal blocked (not in adm group). Check reboot/OOM/load + real log format.
Luci
Root cause nailed. Full chain:
- **No reboot** (up 8 days). System-level **OOM killer** fired — multiple python3 procs at ~14-15GB RSS each killed (`global_oom`). Memory exhaustion this morning (the vault embedding jobs + heavy tasks).
- OOM left **zombie/hung scheduler tick python3 procs** (PIDs 1170849, 1199823, 1214182) in the service cgroup holding `/tmp/scheduler.lock`.
- `ExecStart` = `flock -n /tmp/scheduler.lock python3 scheduler.py tick`. With lock held by a wedged proc, every per-minute `systemctl start` → `flock -n` fails → `status=1/FAILURE`. Repeated 07:01→08:51.
- Around 08:51 leftover procs cleared → scheduler resumed, ran full tick 08:52, healthy now (NRestarts=0, Result=success).
cron-watchdog NOT broken — victim of scheduler-wide stall. Enabled ✓, schedule ✓. No lock/prereq specific to it.
Verify current state + check if zombie procs still linger.
Luci
Current state healthy: zombie PIDs all **DEAD**, lock held only by the live healthy tick (1302646/1302647), scheduler ticking. Mem tight (166Mi free, 8.5GB avail, 4.3GB swap used) but functional. Self-recovered.
The durable fragility: `ExecStart=flock -n /tmp/scheduler.lock python3 scheduler.py tick` has **no timeout**. A wedged tick holds the flock forever → blocks every future tick until OOM/manual kill. Exactly what gave the ~110min stall. Need to know task exec model before picking a safe timeout ceiling.
Luci
Major correction. Tasks run **INLINE** (`proc.communicate(timeout=task["timeout"])`) — one tick holds the flock for the entire task duration. Longest runs: **padel-tournament 7886s, investment-weekly-digest 6587s (~1h50m)**. Max task timeout 7200s.
The gap was 07:01→08:51 = **~110min ≈ investment-weekly-digest's 6587s run**. And context shows investment-weekly ran this morning (~07:03, NotebookLM 45min audio). Hypothesis flip: **a single long inline task held the global scheduler flock for ~1h50m, blocking every other task** including cron-watchdog. OOM was collateral from the heavy NotebookLM/embedding load. There's a `background` flag (line 1697). Verify.
Luci
**CONFIRMED — smoking gun:**
```
investment-weekly-digest | 2026-05-30T07:01:15 → 08:51:02 | 6587s (1h49m47s)
```
Exactly matches the gap (07:01→08:51). Mechanism proven:
- investment-weekly-digest = `background: false` (not in background list), `timeout: 7200`, runs **inline** via `proc.communicate(timeout=7200)`.
- Tick that picked it up at 07:01:15 held the **global `/tmp/scheduler.lock`** for the whole 1h50m NotebookLM render.
- Every per-minute `flock -n` start → lock busy → exit 1 → no other task ran.
- 08:51:02 it finished → lock released → ticks resumed 08:52.
- OOM = collateral from the same heavy task (NotebookLM/Chrome + embedding spikes), not the blocker.
**Root cause = architectural:** a long inline task holds the single global flock and starves ALL scheduled tasks for its full duration. The two longest tasks (investment-weekly 6587s, padel-tournament 7886s) are NOT marked `background:true`, unlike the other 6 long tasks. Fix = mark them background so the tick detaches and releases the lock. Must verify the background code path actually detaches before applying.
Luci
Architecture fully clear. The global serialization is systemd's `flock -n /tmp/scheduler.lock` wrapping the **whole tick process**, which runs all due tasks **sequentially inline**. A non-background task that blocks for 1h50m (investment-weekly's `proc.communicate(timeout=7200)`) keeps the tick process alive → every next-minute `flock -n` fails → total starvation.
The `background: true` path is the existing, purpose-built fix: detaches via `launch_background_task()`, returns immediately, tick exits in seconds, global flock released. Per-task lock + `has_live_background_run` guard prevent double-launch. Proven culprit = **investment-weekly-digest** (not background, 7200s timeout, 6587s actual). Fix = mark it background. Verify `launch_background_task` handles `run_as: shell` first.
Luci
`launch_background_task` runs the command via bash — handles `run_as: shell` fine (existing bg tasks radio-daily-brief/ceo-audio-briefing are shell too). investment-weekly already has `retry: false` so background path's "no retry" loses nothing; `notify_on: failure` preserved via reconcile. Fix confirmed correct + surgical (one-line frontmatter). Invoking dev-loop per mandatory gate.
Luci
Tier 1 confirmed. Apply one-line frontmatter change. Read full task file first.
Luci
Change applied, no wiki contradiction. Smoke test: confirm scheduler's `load_tasks()` now reads `background: true` for the task.
Luci
Smoke passes — `background=True` parsed, joins the other 6 bg tasks. Commit + push (pool-0 worktree, must persist before DONE).
Luci
`nothing to commit` — my Edit hit the **main checkout** (`/home/lucienne/workspace/tasks/...`, absolute path) not the pool-0 worktree (CWD). Good news: that's the **live** file the scheduler reads — fix is active now. But it's uncommitted in the main tree. Check + commit there.
Luci
Main checkout on master, one-line change present and **live** (scheduler reads this file). Commit + push on master so it persists and syncs.
Luci
Committed on master **6e3b42c2**, pushed. Fix live + persisted.
Let me load the report tool and close out.