You're offline — showing cached data

Wiki

10-health-and-ops/overview
2026-06-13 08:46:36 SAST
Wiki Home → 10-health-and-ops/overview

Health and Operations Overview

Luci's health monitoring, backup strategy, and operational maintenance systems.

Heartbeat System

Script: /home/lucienne/workspace/heartbeat.py Schedule: Every 5 minutes via cron (one of only two allowed crontab entries) Output: Writes /home/lucienne/workspace/PKA/luci-status.json for the 02-mission-control/overview|PKA dashboard

What It Checks

  1. Claude Code CLI -- verifies claude --version responds within 5 seconds. Failure = "degraded"
  2. Mission Control API -- hits localhost:3001/api/v1/tickets?limit=1 with bearer token. Failure = "critical"
  3. Disk space -- checks free disk percentage. Below 20% = "degraded", below 10% = "critical"
  4. Failed tasks (1h) -- queries task_runs table in mc.db. 3+ failures in the last hour = "critical"
  5. Stale lock cleanup -- removes /tmp/luci-task-*.lock files older than 30 minutes (prevents scheduler deadlocks)
  6. OAuth token health -- reads status from /home/lucienne/workspace/data/oauth-health-status.json (written by the hourly OAuth health check). Expired tokens = "degraded"

Health States

State Meaning Action
healthy All checks pass None
degraded Non-critical issue (disk low, CLI down, OAuth expired) Logged, visible on dashboard
critical Core service down or multiple failures Telegram alert sent immediately

Data Flow

OAuth Health Check

Script: /home/lucienne/workspace/scripts/oauth_health_check.py Schedule: Hourly via 03-scheduler/overview|scheduler

Checks two token sources: - Google Workspace (GWS) -- tests via gws drive about get API call. Detects invalid_grant, invalid_client - Microsoft 365 (M365) -- validates ~/.graph-api-token.json refresh token, attempts token refresh via graph_api.py

On failure with needs_reauth: true, sends Telegram alert with link to re-auth portal at http://100.118.207.3:8788. Scheduler-created MC auth tickets are assigned to Elmar in needs_input with the portal link in the description; Luci should not try to self-heal revoked OAuth refresh tokens.

Backup Strategy

Workspace Backup

Task: workspace-backup (daily at 02:00 UTC) - Commits all workspace changes to github.com/conrelma/luci-workspace - Covers: scripts/, tasks/, price-watch/, projects/smart-money, projects/padel-tournament, scheduler.py, heartbeat.py, luci-manifest.md, Vault/ - Excludes: .env files, databases, logs, repos with their own remotes (cowork, PKA, mission-control, f1-predictor)

Mission Control DB Backup

Task: mc-db-backup (daily at 02:00 SAST) - Uses SQLite online backup API for safe concurrent backup - Keeps 7 daily rotating backups - mc.db uses WAL mode for concurrent read/write safety

Git Sync (PKA Repo)

Task: git-sync (every 15 minutes, currently disabled) - Pulls latest from GitHub, commits local changes, pushes - Excludes: vault.db, mc.db, .bak files, luci-status.json, .claude/worktrees/ - Uses dedicated SSH key (id_ed25519_pka) - vault.db is owned by Lucienne -- Luci reads it via pull but never writes

Skills Sync

Task: skills-sync -- pushes skill changes from ~/.claude-repo/skills/ to GitHub (conrelma/claude)

Luci Janitor (MC-2716)

Scripts: ~/workspace/mission-control/luci_janitor.py + janitor_classifier.py Schedule: Hourly via 03-scheduler/overview|scheduler (luci-janitor task) Since: 2026-05-02 — replaces stuck-ticket-detector (which only did timeout-based resets)

Classifies ALL non-terminal MC tickets and auto-recovers where possible. Uses janitor_classifier.py for pure classification + action planning (fully unit-tested).

Classifier Verdict Action
stale_in_progress Kill worker process (SIGTERM→SIGKILL), reset to todo
stale_needs_input Reset to todo if >24h with no reply
orphan_worker Reset to todo, alert via Telegram
auto_close Close with resolution comment (e.g. source email actioned, MC-2700)

Learnings are logged to ~/workspace/janitor_learnings.jsonl for pattern analysis.

Janitor Morning Brief (janitor-morning-brief.md)

Janitor Weekly Digest (janitor-weekly-digest.md)

Pickup Watchdog

Script: /home/lucienne/workspace/scripts/pickup_watchdog.py Schedule: Periodic via scheduler

Two functions: 1. Re-enable suspended pickup tasks -- if ticket-pickup.md or needs-input-pickup.md have enabled: false, flips them back to enabled: true and sends a Telegram alert 2. Validate Claude auth -- runs claude -p "Say ok" to verify the CLI can authenticate. Sends a CRITICAL Telegram alert if auth is broken.

API Error Tracking

Script: /home/lucienne/workspace/scripts/api_error_tracker.py Log: ~/workspace/logs/api-errors.jsonl

Tracks 529 (overloaded) errors and rate limits across all Claude worker sessions: - Workers call api_error_tracker.py log when they detect 529/overload/rate-limit in output - api_error_tracker.py summary returns counts for last 1h and 24h, grouped by error type - Used by heartbeat/dashboard for visibility into API pressure

Queue Reaper (MC-462)

Script: /home/lucienne/workspace/scripts/queue_reaper.py Schedule: Every 15 minutes via 03-scheduler/overview|scheduler

Prevents stuck queued_messages — user replies that workers failed to consume:

Action Trigger Result
Expire Message >30 min unclaimed Marked expired, Telegram alert
Retry Failed message, <3 attempts Re-queued for pickup
Dead-letter 3+ failed attempts Marked permanently failed
Alert Any stuck/expired messages Telegram notification to Elmar

Before MC-462, messages like "yes, go ahead" could sit indefinitely unprocessed with no visibility.

Worker Watchdog (in mc_pickup.py)

Built into the ticket worker loop: - Heartbeat interval: every 30 minutes, workers touch the MC ticket to signal liveness - Max worker runtime: 60 minutes hard kill -- prevents infinite heartbeat spam from stuck processes - When elapsed time exceeds MAX_WORKER_RUNTIME, the heartbeat thread kills the worker process and comments on the ticket

Related

Key Takeaways

Help