Health and Operations Overview

Luci's health monitoring, backup strategy, and operational maintenance systems.

Heartbeat System

Script: /home/lucienne/workspace/heartbeat.py Schedule: Every 5 minutes via cron (one of only two allowed crontab entries) Output: Writes /home/lucienne/workspace/PKA/luci-status.json for the 02-mission-control/overview|PKA dashboard

What It Checks

Claude Code CLI -- verifies claude --version responds within 5 seconds. Failure = "degraded"
Mission Control API -- hits localhost:3001/api/v1/tickets?limit=1 with bearer token. Failure = "critical"
Disk space -- checks free disk percentage. Below 20% = "degraded", below 10% = "critical"
Failed tasks (1h) -- queries task_runs table in mc.db. 3+ failures in the last hour = "critical"
Stale lock cleanup -- removes /tmp/luci-task-*.lock files older than 30 minutes (prevents scheduler deadlocks)
OAuth token health -- reads status from /home/lucienne/workspace/data/oauth-health-status.json (written by the hourly OAuth health check). Expired tokens = "degraded"

Health States

State	Meaning	Action
healthy	All checks pass	None
degraded	Non-critical issue (disk low, CLI down, OAuth expired)	Logged, visible on dashboard
critical	Core service down or multiple failures	Telegram alert sent immediately

Data Flow

Heartbeat results are logged to heartbeats table in mc.db (pruned after 7 days)
Status JSON includes: scheduled task list with last/next run times, recent task runs, 24h failure list
Dashboard reads luci-status.json to render the Luci health panel

OAuth Health Check

Script: /home/lucienne/workspace/scripts/oauth_health_check.py Schedule: Hourly via 03-scheduler/overview|scheduler

Checks two token sources: - Google Workspace (GWS) -- tests via gws drive about get API call. Detects invalid_grant, invalid_client - Microsoft 365 (M365) -- validates ~/.graph-api-token.json refresh token, attempts token refresh via graph_api.py

On failure with needs_reauth: true, sends Telegram alert with link to re-auth portal at http://100.118.207.3:8788. Scheduler-created MC auth tickets are assigned to Elmar in needs_input with the portal link in the description; Luci should not try to self-heal revoked OAuth refresh tokens.

Backup Strategy

Workspace Backup

Task: workspace-backup (daily at 02:00 UTC) - Commits all workspace changes to github.com/conrelma/luci-workspace - Covers: scripts/, tasks/, price-watch/, projects/smart-money, projects/padel-tournament, scheduler.py, heartbeat.py, luci-manifest.md, Vault/ - Excludes: .env files, databases, logs, repos with their own remotes (cowork, PKA, mission-control, f1-predictor)

Mission Control DB Backup

Task: mc-db-backup (daily at 02:00 SAST) - Uses SQLite online backup API for safe concurrent backup - Keeps 7 daily rotating backups - mc.db uses WAL mode for concurrent read/write safety

Git Sync (PKA Repo)

Task: git-sync (every 15 minutes, currently disabled) - Pulls latest from GitHub, commits local changes, pushes - Excludes: vault.db, mc.db, .bak files, luci-status.json, .claude/worktrees/ - Uses dedicated SSH key (id_ed25519_pka) - vault.db is owned by Lucienne -- Luci reads it via pull but never writes

Skills Sync

Task: skills-sync -- pushes skill changes from ~/.claude-repo/skills/ to GitHub (conrelma/claude)

Luci Janitor (MC-2716)

Scripts: ~/workspace/mission-control/luci_janitor.py + janitor_classifier.py Schedule: Hourly via 03-scheduler/overview|scheduler (luci-janitor task) Since: 2026-05-02 — replaces stuck-ticket-detector (which only did timeout-based resets)

Classifies ALL non-terminal MC tickets and auto-recovers where possible. Uses janitor_classifier.py for pure classification + action planning (fully unit-tested).

Classifier Verdict	Action
stale_in_progress	Kill worker process (SIGTERM→SIGKILL), reset to `todo`
stale_needs_input	Reset to `todo` if >24h with no reply
orphan_worker	Reset to `todo`, alert via Telegram
auto_close	Close with resolution comment (e.g. source email actioned, MC-2700)

Learnings are logged to ~/workspace/janitor_learnings.jsonl for pattern analysis.

Janitor Morning Brief (`janitor-morning-brief.md`)

Daily at 06:30 UTC — posts digest to standing brief ticket + Elmar Inbox
Summarizes overnight janitor activity and patterns

Janitor Weekly Digest (`janitor-weekly-digest.md`)

Weekly Monday 07:00 UTC — pattern aggregation + root-cause fix-ticket suggestions
Flags recurring issues that need code fixes

Pickup Watchdog

Script: /home/lucienne/workspace/scripts/pickup_watchdog.py Schedule: Periodic via scheduler

Two functions: 1. Re-enable suspended pickup tasks -- if ticket-pickup.md or needs-input-pickup.md have enabled: false, flips them back to enabled: true and sends a Telegram alert 2. Validate Claude auth -- runs claude -p "Say ok" to verify the CLI can authenticate. Sends a CRITICAL Telegram alert if auth is broken.

API Error Tracking

Script: /home/lucienne/workspace/scripts/api_error_tracker.py Log: ~/workspace/logs/api-errors.jsonl

Tracks 529 (overloaded) errors and rate limits across all Claude worker sessions: - Workers call api_error_tracker.py log when they detect 529/overload/rate-limit in output - api_error_tracker.py summary returns counts for last 1h and 24h, grouped by error type - Used by heartbeat/dashboard for visibility into API pressure

Queue Reaper (MC-462)

Script: /home/lucienne/workspace/scripts/queue_reaper.py Schedule: Every 15 minutes via 03-scheduler/overview|scheduler

Prevents stuck queued_messages — user replies that workers failed to consume:

Action	Trigger	Result
Expire	Message >30 min unclaimed	Marked expired, Telegram alert
Retry	Failed message, <3 attempts	Re-queued for pickup
Dead-letter	3+ failed attempts	Marked permanently failed
Alert	Any stuck/expired messages	Telegram notification to Elmar

Before MC-462, messages like "yes, go ahead" could sit indefinitely unprocessed with no visibility.

Worker Watchdog (in mc_pickup.py)

Built into the ticket worker loop: - Heartbeat interval: every 30 minutes, workers touch the MC ticket to signal liveness - Max worker runtime: 60 minutes hard kill -- prevents infinite heartbeat spam from stuck processes - When elapsed time exceeds MAX_WORKER_RUNTIME, the heartbeat thread kills the worker process and comments on the ticket

Key Takeaways

Heartbeat runs every 5 minutes, checks 6 health dimensions, alerts on critical via Telegram
Three health states: healthy, degraded, critical -- with automatic escalation
Backups: workspace to GitHub daily, mc.db with 7-day rotation, PKA repo every 15 minutes
Luci janitor (MC-2716) runs hourly, classifies non-terminal tickets, auto-recovers where possible; replaced stuck-ticket-detector on 2026-05-02
Pickup watchdog prevents the system from getting permanently stuck by re-enabling disabled pickup tasks
Queue reaper runs every 15 min to expire stuck messages (>30 min), retry failed ones, and alert via Telegram
API error tracker provides visibility into 529/rate-limit pressure across all workers

Wiki

Health and Operations Overview

Heartbeat System

What It Checks

Health States

Data Flow

OAuth Health Check

Backup Strategy

Workspace Backup

Mission Control DB Backup

Git Sync (PKA Repo)

Skills Sync

Luci Janitor (MC-2716)

Janitor Morning Brief (`janitor-morning-brief.md`)

Janitor Weekly Digest (`janitor-weekly-digest.md`)

Pickup Watchdog

API Error Tracking

Queue Reaper (MC-462)

Worker Watchdog (in mc_pickup.py)

Related

Key Takeaways

What is Mission Control?

Wiki

Health and Operations Overview

Heartbeat System

What It Checks

Health States

Data Flow

OAuth Health Check

Backup Strategy

Workspace Backup

Mission Control DB Backup

Git Sync (PKA Repo)

Skills Sync

Luci Janitor (MC-2716)

Janitor Morning Brief (janitor-morning-brief.md)

Janitor Weekly Digest (janitor-weekly-digest.md)

Pickup Watchdog

API Error Tracking

Queue Reaper (MC-462)

Worker Watchdog (in mc_pickup.py)

Related

Key Takeaways

Janitor Morning Brief (`janitor-morning-brief.md`)

Janitor Weekly Digest (`janitor-weekly-digest.md`)