Scheduler Overview

The scheduler (~/workspace/scheduler.py) is Luci's task execution engine. It runs every minute via cron, evaluates which tasks are due based on cron expressions, and executes them with locking, retry, self-healing, and failure escalation.

How It Works

Cron drives the tick. The system crontab calls python3 scheduler.py tick every minute. This is one of only two entries allowed in Luci's crontab (the other is the heartbeat).
Task definitions live in ~/workspace/tasks/ as markdown files with YAML frontmatter.
Each tick: loads all task files, checks which are due (comparing cron schedule against last successful run in mc.db), and runs them sequentially.
All run history is recorded in mc.db (task_runs table) -- started_at, finished_at, status, output, duration.

Tick Cycle

Load all .md files from ~/workspace/tasks/
Parse YAML frontmatter; reject files with missing id or schedule
Duplicate ID check -- if two files share an ID, the scheduler exits immediately and sends a Telegram alert
For each enabled task:
Skip if locked (another instance still running)
Kill stale locks (lock age > timeout + 60s)
Check if due (cron expression vs last completed run)
If due, acquire lock and run

YAML Task Format

Every task file starts with a YAML frontmatter block:

---
id: example-task            # Unique identifier (duplicate = hard crash)
title: Human-readable name
schedule: "0 6 * * 1-5"    # Cron expression (evaluated in SAST)
timeout: 300                # Max seconds before kill (default: 600)
retry: true                 # Simple retry on first failure (default: false)
enabled: true               # false = skip entirely (default: true)
disabled_reason: by_choice  # why disabled: auto_suspended | retired | paused | by_choice (set when enabled: false)
self_heal: true             # Allow Claude to diagnose and fix (default: true)
notify_on: failure          # failure | success | always | never (default: failure)
notify_to: home             # notify.py destination key: dm|home|work|mc|life-manager|general (optional; injected as LUCI_NOTIFY_DEST env)
run_as: shell               # shell | claude | script
command: "python3 foo.py"   # Shell command to execute
tags: [infra, backup]       # Categorization tags
---

Markdown body with human-readable description of what the task does.

Execution Model

Commands run via subprocess.Popen() with shell=True, /bin/bash, and start_new_session=True — the child gets its own process group so a timeout kills the whole tree (bash + claude + grandchildren) with os.killpg, not just the direct bash child
Environment includes an allow-listed refresh from ~/.claude/env/api_keys.env and ~/.bashrc on each run, plus ~/.npm-global/bin on PATH. The refresh is non-interactive (bash --noprofile --norc) so interactive aliases such as the Telegram-enabled claude alias cannot leak into scheduler jobs.
Default cwd is ~/workspace, unless a task sets explicit cwd or cwd_policy
Lock files at /tmp/luci-task-{id}.lock contain PID and start time (JSON)
Lock is acquired atomically with O_CREAT | O_EXCL to prevent races

Claude Command Isolation

The persistent Luci/Telegram session is the only process allowed to use the Telegram-enabled Claude configuration. Scheduler-owned Claude calls are guarded automatically:

Bare claude, ${CLAUDE}, /usr/bin/env claude, and the standard ~/.local/bin/claude path are wrapped so they run with --settings ~/.claude/settings-worker.json.
TELEGRAM_BOT_TOKEN is cleared for those Claude task commands so they cannot start a second Telegram poller.
Scheduler provider selection follows only the scheduler provider state file. It does not silently inherit the persistent Luci provider.

If a task intentionally needs a different Claude configuration, make that explicit in the task definition and document why. Avoid sudo claude, remote ssh ... claude, or sh -c 'claude ...' in scheduler commands because those can bypass the bash function wrapper.

Working Directory Policy

The scheduler is machine-level infrastructure, so it defaults task commands to ~/workspace. Task definitions may override this in two ways:

Task setting	Result
`cwd: /some/path`	Run exactly from that path
`cwd_policy: pka` or `pka_repo`	Run from `~/workspace/PKA`
`cwd_policy: mission-control`, `mission_control`, or `mc`	Run from `~/workspace/mission-control`
No cwd setting	Run from `~/workspace`

Many legacy task commands still begin with an explicit cd ...; that remains valid and should be treated as the command's own local override. Ticket workers use a related project-based resolver documented in 02-mission-control/worker-system.

Comment-Driven Control

Before running a task, the scheduler checks 02-mission-control/overview|Mission Control for unread human comments on the task. Claude interprets the comments and returns one of:

SKIP -- do not run this cycle
MODIFY -- reload the task definition (Claude may have edited it)
INVESTIGATE -- run with extra attention
RUN -- proceed normally (default)

This allows Elmar to pause or adjust tasks by commenting on them in the MC dashboard.

Self-Healing

When a task fails, the scheduler follows an escalation ladder:

Attempt 1: Run the command
Attempt 2: Simple retry (if retry: true)
Attempt 3: Self-heal -- Claude diagnoses the error and edits the script/task
Attempt 4: Second self-heal attempt with updated error context
Suspension: Task is disabled (enabled: false), Telegram alert sent, MC ticket created

Self-Heal Guard Rails

Diff validation: Before re-running after a heal, the git diff is checked:
Only safe file extensions allowed (.py, .md, .yaml, .html, .json, etc.)
Max 150 lines changed
No dangerous patterns (subprocess, exec, eval, os.remove, etc.)
Probation: After a successful self-heal, the task enters probation. If it fails again while on probation, self-heal is skipped and Elmar is alerted immediately. Probation clears after 3 consecutive clean runs.
Per-task opt-out: Tasks can set self_heal: false to disable healing entirely
Audit log: Every heal attempt is logged to ~/workspace/logs/self-heal-audit.log

Consecutive Failure Handling

For tasks with self_heal: false, the scheduler tracks consecutive failures:

After MAX_CONSECUTIVE_FAILURES (3) consecutive failures, the task is suspended
suspend_task stamps disabled_reason: auto_suspended + disabled_at in the task frontmatter, so the tasks page can tell a failure-suspended task from one disabled on purpose
A Telegram alert and MC ticket are created
The failure count resets on any successful run
Disabling a task from the MC tasks page stamps disabled_reason: by_choice; re-enabling strips both keys

Error Escalation

Event	Action
Task fails once	Retry (if enabled)
Retry fails	Self-heal attempt 1
Heal 1 fails	Self-heal attempt 2
Heal 2 fails	Suspend task, Telegram alert (force, bypasses quiet hours), MC ticket
Task timeout	Log as timeout, Telegram alert, MC ticket
Scheduler crash	Telegram alert (force), MC ticket, "all tasks paused" warning
Duplicate task ID	Hard exit with Telegram alert

Key Commands

python3 scheduler.py tick       # Cron calls this every minute
python3 scheduler.py run <id>   # Force-run a specific task (ignores schedule)
python3 scheduler.py list       # Show all tasks with schedule, enabled, last/next run
python3 scheduler.py history    # Show last 20 task runs from mc.db

File Locations

Path	Purpose
`~/workspace/scheduler.py`	Main scheduler code
`~/workspace/tasks/*.md`	Task definitions
`~/workspace/mission-control/mc.db`	Run history (task_runs table)
`/tmp/luci-task-{id}.lock`	Per-task lock files
`~/workspace/logs/self-heal-audit.log`	Heal attempt audit trail
`~/workspace/.heal-state.json`	Probation tracking state
`~/workspace/logs/fail-counts/{id}.count`	Consecutive failure counters
`~/workspace/prompts/self-heal.txt`	Prompt template for Claude self-heal
`~/workspace/prompts/check-comments.txt`	Prompt template for comment interpretation

Key Takeaways

The scheduler runs every minute via cron and evaluates task cron expressions against mc.db run history
Task definitions are YAML-frontmatter markdown files in ~/workspace/tasks/
Duplicate task IDs cause a hard crash by design -- this prevents silent conflicts
Self-healing uses Claude to diagnose and fix failures, with strict diff validation and a probation system
After all retry/heal attempts fail, tasks are auto-suspended and Elmar is alerted via Telegram (bypassing quiet hours)
Human comments on tasks in Mission Control can skip, modify, or investigate a task run

Wiki