You're offline — showing cached data

Wiki

10-health-and-ops/troubleshooting
2026-06-13 07:28:51 SAST
Wiki Home → 10-health-and-ops/troubleshooting

Troubleshooting Guide

Common failure modes on Luci and how to resolve them.

Stuck Workers

Symptom: Tickets stay in_progress indefinitely; no progress comments appear on the ticket.

Causes: - Claude subprocess hung waiting for input or stuck in a loop - API rate limit caused the worker to stall without clean exit - Worker crashed but lock file was not cleaned up

Automated recovery: - stuck_ticket_detector.py runs hourly: kills worker process groups (SIGTERM, then SIGKILL after 5s) and resets tickets to todo - mc_pickup.py internal watchdog: kills workers after 60 minutes (MAX_WORKER_RUNTIME = 3600) - Heartbeat cleans stale /tmp/luci-task-*.lock files older than 30 minutes

Manual recovery: 1. Check for orphaned processes: ps aux | grep claude 2. Check lock files: ls /tmp/mc-worker-*.lock /tmp/luci-task-*.lock 3. Kill the process group: kill -TERM -$(ps -o pgid= -p <PID> | tr -d ' ') 4. Remove the lock file manually 5. Reset the ticket via MC API: curl -X PATCH localhost:3001/api/v1/tickets/<id> -H "Authorization: Bearer $MC_TOKEN_LUCI" -H "Content-Type: application/json" -d '{"status":"todo"}'

API Rate Limits (529 Errors)

Symptom: Workers fail with "overloaded" or "529" in output. Multiple workers may fail simultaneously.

Causes: - Anthropic API overloaded (usually during peak hours) - Too many concurrent workers hitting the API simultaneously - Council reviews (second-opinion skill) making parallel API calls

Automated tracking: - mc_pickup.py detects "529" or "rate limit" in worker output and logs to api_error_tracker.py - Error log at ~/workspace/logs/api-errors.jsonl - Summary available via python3 scripts/api_error_tracker.py summary

Resolution: - Usually transient -- workers retry on next pickup cycle - If persistent, reduce concurrent worker count (configured in mc_pickup.py) - Check python3 scripts/api_error_tracker.py recent --limit 20 to see the pattern - During severe outages, temporarily disable pickup tasks to avoid wasting cycles

Pickup Tasks Getting Disabled

Symptom: No new tickets being picked up. ticket-pickup.md or needs-input-pickup.md show enabled: false.

Causes: - Scheduler disables tasks after repeated failures - Manual disable during maintenance that was not re-enabled

Automated recovery: - pickup_watchdog.py runs periodically, flips enabled: false back to enabled: true, and sends a Telegram alert

Manual recovery: 1. Edit the task file directly: vim ~/workspace/tasks/ticket-pickup.md 2. Change enabled: false to enabled: true 3. Verify scheduler sees it: check next scheduler tick output

Claude Auth Failure

Symptom: All workers fail immediately. pickup_watchdog.py reports "CRITICAL: Claude auth broken."

Causes: - API key expired or revoked - Environment variable ANTHROPIC_API_KEY not loaded (systemd/cron env issue) - Claude CLI binary corrupted or missing

Resolution: 1. Check the key: echo $ANTHROPIC_API_KEY | head -c 20 2. Verify CLI works: ~/.local/bin/claude --version 3. Test auth: ~/.local/bin/claude -p "Say ok" --dangerously-skip-permissions --max-turns 1 4. If env is missing, check that api_keys.env is sourced in systemd unit files and cron environment

OAuth Token Expiry

Symptom: Google Workspace or Microsoft 365 integrations fail. Dashboard shows "OAuth expired."

Causes: - Refresh token expired (Google tokens expire after ~7 days without use; M365 after 90 days) - Token file corrupted or deleted

Detection: - oauth_health_check.py runs hourly, writes status to ~/workspace/data/oauth-health-status.json - Heartbeat reads this status and reflects it in health state - Telegram alert sent with re-auth link on failure

Resolution: 1. Open http://100.118.207.3:8788 (auth portal on Luci) 2. Re-authenticate the expired service (GWS or M365) 3. Verify: python3 scripts/oauth_health_check.py 4. For M365 specifically: python3 scripts/graph_api.py refresh-token or python3 scripts/graph_api.py login

Service Crashes (Mission Control, Spotify Radio, etc.)

Symptom: MC API not responding (heartbeat shows "critical"). Dashboard inaccessible.

Causes: - Node.js process crashed (memory, unhandled exception) - Systemd service stopped

Resolution: 1. Check service status: systemctl --user status mission-control 2. View recent logs: journalctl --user -u mission-control --since "1 hour ago" 3. Restart: systemctl --user restart mission-control 4. For port conflicts: lsof -i :3001 to find what's using the port

Git Sync Issues

Symptom: PKA repo out of date. Changes from Lucienne not appearing on Luci, or Luci's changes not visible to Lucienne.

Causes: - git-sync task disabled (currently enabled: false) - Merge conflict during rebase - SSH key permission issue - vault.db or worktree files causing conflicts

Resolution: 1. Check task status: read ~/workspace/tasks/git-sync.md -- verify enabled: true 2. Manual sync: cd ~/workspace/PKA git stash GIT_SSH_COMMAND='ssh -i ~/.ssh/id_ed25519_pka -o IdentitiesOnly=yes' git pull --rebase GIT_SSH_COMMAND='ssh -i ~/.ssh/id_ed25519_pka -o IdentitiesOnly=yes' git push git stash pop 3. If conflict: git rebase --abort, then git pull with merge strategy instead 4. Verify excluded files are not staged: vault.db, mc.db, .bak, luci-status.json, .claude/worktrees/

Database Locks

Symptom: SQLite "database is locked" errors in mc.db or vault.db.

Causes: - Multiple writers to the same database simultaneously - Long-running transaction holding a lock - Process crashed while holding a write lock

Resolution: 1. Check for processes holding the database: fuser ~/workspace/mission-control/mc.db 2. mc.db uses WAL mode, which allows concurrent reads -- locks are usually from competing writes 3. If truly locked: identify and kill the offending process, then sqlite3 mc.db "PRAGMA wal_checkpoint(TRUNCATE);" to clean up the WAL 4. vault.db is read-only on Luci (owned by Lucienne) -- if lock errors appear, something is incorrectly trying to write to it

Scheduler Not Running Tasks

Symptom: Tasks not executing on schedule. Dashboard shows stale "last run" times.

Causes: - Scheduler cron entry missing from crontab - Task file has enabled: false - Task file has invalid YAML frontmatter - Scheduler process crashing on tick

Resolution: 1. Check crontab: crontab -l -- should have scheduler.py tick entry 2. Check task file frontmatter: valid id, schedule, enabled: true 3. Manual tick: python3 ~/workspace/scheduler.py tick -- watch for errors 4. Check scheduler logs in mc.db task_runs table

Related

Key Takeaways

Help