Common failure modes on Luci and how to resolve them.
Symptom: Tickets stay in_progress indefinitely; no progress comments appear on the ticket.
Causes: - Claude subprocess hung waiting for input or stuck in a loop - API rate limit caused the worker to stall without clean exit - Worker crashed but lock file was not cleaned up
Automated recovery:
- stuck_ticket_detector.py runs hourly: kills worker process groups (SIGTERM, then SIGKILL after 5s) and resets tickets to todo
- mc_pickup.py internal watchdog: kills workers after 60 minutes (MAX_WORKER_RUNTIME = 3600)
- Heartbeat cleans stale /tmp/luci-task-*.lock files older than 30 minutes
Manual recovery:
1. Check for orphaned processes: ps aux | grep claude
2. Check lock files: ls /tmp/mc-worker-*.lock /tmp/luci-task-*.lock
3. Kill the process group: kill -TERM -$(ps -o pgid= -p <PID> | tr -d ' ')
4. Remove the lock file manually
5. Reset the ticket via MC API: curl -X PATCH localhost:3001/api/v1/tickets/<id> -H "Authorization: Bearer $MC_TOKEN_LUCI" -H "Content-Type: application/json" -d '{"status":"todo"}'
Symptom: Workers fail with "overloaded" or "529" in output. Multiple workers may fail simultaneously.
Causes: - Anthropic API overloaded (usually during peak hours) - Too many concurrent workers hitting the API simultaneously - Council reviews (second-opinion skill) making parallel API calls
Automated tracking:
- mc_pickup.py detects "529" or "rate limit" in worker output and logs to api_error_tracker.py
- Error log at ~/workspace/logs/api-errors.jsonl
- Summary available via python3 scripts/api_error_tracker.py summary
Resolution:
- Usually transient -- workers retry on next pickup cycle
- If persistent, reduce concurrent worker count (configured in mc_pickup.py)
- Check python3 scripts/api_error_tracker.py recent --limit 20 to see the pattern
- During severe outages, temporarily disable pickup tasks to avoid wasting cycles
Symptom: No new tickets being picked up. ticket-pickup.md or needs-input-pickup.md show enabled: false.
Causes: - Scheduler disables tasks after repeated failures - Manual disable during maintenance that was not re-enabled
Automated recovery:
- pickup_watchdog.py runs periodically, flips enabled: false back to enabled: true, and sends a Telegram alert
Manual recovery:
1. Edit the task file directly: vim ~/workspace/tasks/ticket-pickup.md
2. Change enabled: false to enabled: true
3. Verify scheduler sees it: check next scheduler tick output
Symptom: All workers fail immediately. pickup_watchdog.py reports "CRITICAL: Claude auth broken."
Causes:
- API key expired or revoked
- Environment variable ANTHROPIC_API_KEY not loaded (systemd/cron env issue)
- Claude CLI binary corrupted or missing
Resolution:
1. Check the key: echo $ANTHROPIC_API_KEY | head -c 20
2. Verify CLI works: ~/.local/bin/claude --version
3. Test auth: ~/.local/bin/claude -p "Say ok" --dangerously-skip-permissions --max-turns 1
4. If env is missing, check that api_keys.env is sourced in systemd unit files and cron environment
Symptom: Google Workspace or Microsoft 365 integrations fail. Dashboard shows "OAuth expired."
Causes: - Refresh token expired (Google tokens expire after ~7 days without use; M365 after 90 days) - Token file corrupted or deleted
Detection:
- oauth_health_check.py runs hourly, writes status to ~/workspace/data/oauth-health-status.json
- Heartbeat reads this status and reflects it in health state
- Telegram alert sent with re-auth link on failure
Resolution:
1. Open http://100.118.207.3:8788 (auth portal on Luci)
2. Re-authenticate the expired service (GWS or M365)
3. Verify: python3 scripts/oauth_health_check.py
4. For M365 specifically: python3 scripts/graph_api.py refresh-token or python3 scripts/graph_api.py login
Symptom: MC API not responding (heartbeat shows "critical"). Dashboard inaccessible.
Causes: - Node.js process crashed (memory, unhandled exception) - Systemd service stopped
Resolution:
1. Check service status: systemctl --user status mission-control
2. View recent logs: journalctl --user -u mission-control --since "1 hour ago"
3. Restart: systemctl --user restart mission-control
4. For port conflicts: lsof -i :3001 to find what's using the port
Symptom: PKA repo out of date. Changes from Lucienne not appearing on Luci, or Luci's changes not visible to Lucienne.
Causes:
- git-sync task disabled (currently enabled: false)
- Merge conflict during rebase
- SSH key permission issue
- vault.db or worktree files causing conflicts
Resolution:
1. Check task status: read ~/workspace/tasks/git-sync.md -- verify enabled: true
2. Manual sync:
cd ~/workspace/PKA
git stash
GIT_SSH_COMMAND='ssh -i ~/.ssh/id_ed25519_pka -o IdentitiesOnly=yes' git pull --rebase
GIT_SSH_COMMAND='ssh -i ~/.ssh/id_ed25519_pka -o IdentitiesOnly=yes' git push
git stash pop
3. If conflict: git rebase --abort, then git pull with merge strategy instead
4. Verify excluded files are not staged: vault.db, mc.db, .bak, luci-status.json, .claude/worktrees/
Symptom: SQLite "database is locked" errors in mc.db or vault.db.
Causes: - Multiple writers to the same database simultaneously - Long-running transaction holding a lock - Process crashed while holding a write lock
Resolution:
1. Check for processes holding the database: fuser ~/workspace/mission-control/mc.db
2. mc.db uses WAL mode, which allows concurrent reads -- locks are usually from competing writes
3. If truly locked: identify and kill the offending process, then sqlite3 mc.db "PRAGMA wal_checkpoint(TRUNCATE);" to clean up the WAL
4. vault.db is read-only on Luci (owned by Lucienne) -- if lock errors appear, something is incorrectly trying to write to it
Symptom: Tasks not executing on schedule. Dashboard shows stale "last run" times.
Causes:
- Scheduler cron entry missing from crontab
- Task file has enabled: false
- Task file has invalid YAML frontmatter
- Scheduler process crashing on tick
Resolution:
1. Check crontab: crontab -l -- should have scheduler.py tick entry
2. Check task file frontmatter: valid id, schedule, enabled: true
3. Manual tick: python3 ~/workspace/scheduler.py tick -- watch for errors
4. Check scheduler logs in mc.db task_runs table
api_error_tracker.py summary to confirmhttp://100.118.207.3:8788enabled: false) -- manual sync may be neededMission Control is the board for your delegated work: requests come in, Luci coordinates the next step, and evidence stays visible for review.
Luci is your always-on assistant for routing, status updates, and follow-through. Operators can still open deeper evidence when needed.