Luci's health monitoring, backup strategy, and operational maintenance systems.
Script: /home/lucienne/workspace/heartbeat.py
Schedule: Every 5 minutes via cron (one of only two allowed crontab entries)
Output: Writes /home/lucienne/workspace/PKA/luci-status.json for the 02-mission-control/overview|PKA dashboard
claude --version responds within 5 seconds. Failure = "degraded"localhost:3001/api/v1/tickets?limit=1 with bearer token. Failure = "critical"task_runs table in mc.db. 3+ failures in the last hour = "critical"/tmp/luci-task-*.lock files older than 30 minutes (prevents scheduler deadlocks)/home/lucienne/workspace/data/oauth-health-status.json (written by the hourly OAuth health check). Expired tokens = "degraded"| State | Meaning | Action |
|---|---|---|
| healthy | All checks pass | None |
| degraded | Non-critical issue (disk low, CLI down, OAuth expired) | Logged, visible on dashboard |
| critical | Core service down or multiple failures | Telegram alert sent immediately |
heartbeats table in mc.db (pruned after 7 days)luci-status.json to render the Luci health panelScript: /home/lucienne/workspace/scripts/oauth_health_check.py
Schedule: Hourly via 03-scheduler/overview|scheduler
Checks two token sources:
- Google Workspace (GWS) -- tests via gws drive about get API call. Detects invalid_grant, invalid_client
- Microsoft 365 (M365) -- validates ~/.graph-api-token.json refresh token, attempts token refresh via graph_api.py
On failure with needs_reauth: true, sends Telegram alert with link to re-auth portal at http://100.118.207.3:8788. Scheduler-created MC auth tickets are assigned to Elmar in needs_input with the portal link in the description; Luci should not try to self-heal revoked OAuth refresh tokens.
Task: workspace-backup (daily at 02:00 UTC)
- Commits all workspace changes to github.com/conrelma/luci-workspace
- Covers: scripts/, tasks/, price-watch/, projects/smart-money, projects/padel-tournament, scheduler.py, heartbeat.py, luci-manifest.md, Vault/
- Excludes: .env files, databases, logs, repos with their own remotes (cowork, PKA, mission-control, f1-predictor)
Task: mc-db-backup (daily at 02:00 SAST)
- Uses SQLite online backup API for safe concurrent backup
- Keeps 7 daily rotating backups
- mc.db uses WAL mode for concurrent read/write safety
Task: git-sync (every 15 minutes, currently disabled)
- Pulls latest from GitHub, commits local changes, pushes
- Excludes: vault.db, mc.db, .bak files, luci-status.json, .claude/worktrees/
- Uses dedicated SSH key (id_ed25519_pka)
- vault.db is owned by Lucienne -- Luci reads it via pull but never writes
Task: skills-sync -- pushes skill changes from ~/.claude-repo/skills/ to GitHub (conrelma/claude)
Scripts: ~/workspace/mission-control/luci_janitor.py + janitor_classifier.py
Schedule: Hourly via 03-scheduler/overview|scheduler (luci-janitor task)
Since: 2026-05-02 — replaces stuck-ticket-detector (which only did timeout-based resets)
Classifies ALL non-terminal MC tickets and auto-recovers where possible. Uses janitor_classifier.py for pure classification + action planning (fully unit-tested).
| Classifier Verdict | Action |
|---|---|
| stale_in_progress | Kill worker process (SIGTERM→SIGKILL), reset to todo |
| stale_needs_input | Reset to todo if >24h with no reply |
| orphan_worker | Reset to todo, alert via Telegram |
| auto_close | Close with resolution comment (e.g. source email actioned, MC-2700) |
Learnings are logged to ~/workspace/janitor_learnings.jsonl for pattern analysis.
janitor-morning-brief.md)janitor-weekly-digest.md)Script: /home/lucienne/workspace/scripts/pickup_watchdog.py
Schedule: Periodic via scheduler
Two functions:
1. Re-enable suspended pickup tasks -- if ticket-pickup.md or needs-input-pickup.md have enabled: false, flips them back to enabled: true and sends a Telegram alert
2. Validate Claude auth -- runs claude -p "Say ok" to verify the CLI can authenticate. Sends a CRITICAL Telegram alert if auth is broken.
Script: /home/lucienne/workspace/scripts/api_error_tracker.py
Log: ~/workspace/logs/api-errors.jsonl
Tracks 529 (overloaded) errors and rate limits across all Claude worker sessions:
- Workers call api_error_tracker.py log when they detect 529/overload/rate-limit in output
- api_error_tracker.py summary returns counts for last 1h and 24h, grouped by error type
- Used by heartbeat/dashboard for visibility into API pressure
Script: /home/lucienne/workspace/scripts/queue_reaper.py
Schedule: Every 15 minutes via 03-scheduler/overview|scheduler
Prevents stuck queued_messages — user replies that workers failed to consume:
| Action | Trigger | Result |
|---|---|---|
| Expire | Message >30 min unclaimed | Marked expired, Telegram alert |
| Retry | Failed message, <3 attempts | Re-queued for pickup |
| Dead-letter | 3+ failed attempts | Marked permanently failed |
| Alert | Any stuck/expired messages | Telegram notification to Elmar |
Before MC-462, messages like "yes, go ahead" could sit indefinitely unprocessed with no visibility.
Built into the ticket worker loop:
- Heartbeat interval: every 30 minutes, workers touch the MC ticket to signal liveness
- Max worker runtime: 60 minutes hard kill -- prevents infinite heartbeat spam from stuck processes
- When elapsed time exceeds MAX_WORKER_RUNTIME, the heartbeat thread kills the worker process and comments on the ticket
Mission Control is the board for your delegated work: requests come in, Luci coordinates the next step, and evidence stays visible for review.
Luci is your always-on assistant for routing, status updates, and follow-through. Operators can still open deeper evidence when needed.