Research-only assessment of the luci-persistent tmux session (the orchestrator). No code/config changed. Date: 2026-05-19.
Scope: the persistent Claude Code session driving Mission Control coordination on Hetzner host openclaw. Files cited are ground truth; line refs are from the indexed snapshot.
CORRECTION (2026-05-19, post-eval): The Q4 conclusion "claude-mem is harmful, remove it" is WRONG — it read state from before Lucienne's fix. claude-mem was installed deliberately by Lucienne (Chief of Staff); her commit
2e082fe0neutralised claude-mem's attempt to disable the auto-memory layer — both run side by side now. claude-mem is tested and working (session capture, summaries, vector search) with a dashboard tile (Apps → Tools → Claude-Mem Viewer, :37701). claude-mem stays. The MC-3731 stuck-session incident was a real but since-resolved past event. Q1/Q2/Q3 findings (liveness, context, toolset) are unaffected and stand.
luci-persistent.service exists. systemctl list-units | grep luci shows only luci-dashboard, luci-heartbeat, luci-scheduler, luci-token-dashboard. The session lives inside the mc-tmux-server.service tmux server (~/.config/systemd/user/mc-tmux-server.service).persistent_luci.ensure_started() (~/workspace/mission-control/persistent_luci.py) is the start/heal primitive. It resolves a runtime profile, runs a preflight, builds the claude command via _command_for_profile() → _claude_cmd(), then calls mc_tmux.start_session(TMUX_TARGET, command, cwd, recreate=...). It records a row in runtime_sessions (SESSION_KEY = persistent:luci).ensure_started() sets recreate=True if the tmux session exists but _tmux_runtime_ready() is false (a session that exists but is not a live Claude runtime). status() marks the runtime row stale with close_reason=tmux_session_missing / tmux_session_not_runtime when tmux is gone or not a runtime.ensure_started() callers: persistent_luci_watchdog.py, persistent_luci.send() itself (every inbound message calls ensure_started(recreate=False) first), and ticket_runtime.py. So whenever a message is routed to the orchestrator, a dead session is lazily resurrected.~/workspace/tasks/persistent-luci-watchdog.md runs python3 scripts/persistent_luci_watchdog.py every 5 min (*/5 * * * *). That script calls only persistent_luci.watchdog_tick() — token management, not liveness. Its docstring: "Compact at 700k/900k. No /clear." It is NOT a keep-alive: watchdog_tick() returns noop / no_tmux_session if the session is dead and does nothing else.rotate-luci-session (daily 03:00 SAST, scripts/rotate-luci-session.sh) deliberately kills the claude process and clears the pinned session_id so it respawns fresh from CLAUDE.md + memory. This is intentional rotation, not crash recovery.luci-operator (every 30 min, luci_operator.py) audits tmux/service state and runtime sessions and "heals routine safe state drift directly" — this is the closest thing to a liveness watchdog, but it is a broad MC-health pass, not a dedicated persistent-session supervisor.mc-tmux-server.service restarts the tmux server on boot, but nothing on that path runs ensure_started() — the persistent window is only recreated lazily on the next send() or watchdog tick that calls ensure_started. Actually the watchdog never calls ensure_started (only watchdog_tick), so post-reboot the session stays dead until a message is sent or luci_operator heals it.send() and on luci_operator's 30-min sweep. After an OOM kill or crash, the orchestrator can be dead for up to 30 min with no alert (the systemd-watchdog only watches ccgram, luci-dashboard, luci-token-dashboard, price-watch, smart-money — not the persistent session).persistent-luci-watchdog only does token compaction. A reader expects it to also restart a dead session; it does not. No-token-footer / no-session both return noop.free -h: 7.6 GiB RAM, 5.1 GiB used, 457 MiB free; swap 9 GiB with 4.3 GiB already used. ps aux shows 8+ concurrent claude processes each 250–460 MB RSS. journalctl shows gunicorn workers SIGKILLed ("Perhaps out of memory?") on May 19 at 03:17 and 08:12. The recurring [operator:memory-pressure] tickets (MC-3330/3458/3541/3612/3641/3682/3708/3733 — roughly daily) confirm chronic pressure. The box is genuinely memory-starved; the orchestrator is one OOM victim among many.MemoryMax, no OOMScoreAdjust. Under pressure the kernel may kill the orchestrator as readily as a throwaway worker.persistent-luci-watchdog (or a new persistent-luci-keepalive task at */5): if persistent_luci.status() reports not-online, call ensure_started() and Telegram-alert. This closes the post-reboot / post-OOM dead window from ~30 min to ~5 min.luci-persistent health to systemd_watchdog.py's watched set (it currently only watches systemd units — add a runtime-session probe).claude[1m] workers. Cap concurrent ticket workers, or set OOMScoreAdjust=-500 on the tmux-server unit so the orchestrator is the last thing killed, not a coin-flip.Restart=always service — it must stay inside the shared mc-tmux-server so the CCGram Telegram lock and console attach keep working. Keep tmux; add supervision around it.persistent_luci.py, top of file): SOFT_TOKEN_THRESHOLD = 700_000, HARD_TOKEN_THRESHOLD = 900_000, WATCHDOG_COOLDOWN_SECONDS = 5*60.watchdog_tick() captures the tmux pane, runs parse_persistent_token_count() against TOKEN_FOOTER_PATTERN ((\d+\.?\d*)\s*k?\s*tokens) to read the live context size off the status footer. Logic:tokens >= hard (900k) → compact_persistent_luci() (after cooldown check).tokens >= soft (700k) → compact_persistent_luci() (after cooldown check)./compact — the watchdog script docstring explicitly says "No /clear." clear_persistent_luci() exists in the file but is not wired into watchdog_tick(). /clear is dead code on the auto path.compact_persistent_luci() sends /compact to the tmux pane. It defers (action: deferred, reason: pane_busy) if _pane_has_busy_indicator() is true — it will not interrupt a running turn.clear_persistent_luci() (manual / unused) records a _continuity_snapshot() (last 4000 chars of sanitized pane) into watchdog_history metadata before sending /clear, and will interrupt a busy pane.rotate-luci-session, daily 03:00) is the real context reset: kills the process, drops the pinned session_id, respawns fresh — Claude reboots from CLAUDE.md + memory/. --resume is deliberately avoided (pinning session_id makes context grow unbounded across days — the lesson from the 48-hour incident, recorded in the task doc).CLAUDE.md + the file-based memory/ dir + project_last_rotation.md (so fresh Luci knows the rotation was intentional). On /compact, continuity = Claude's own compaction summary. On /clear (manual only), the 4000-char snapshot is stored in MC metadata but nothing re-injects it — it is a forensic record, not a restore./compact at 700k is conservative but the cooldown can starve it. 5-min cooldown + watchdog every 5 min means at most one compact per 5 min. If the orchestrator generates tokens fast (large tool outputs, chat-thread context), it can climb from 700k past 900k and keep climbing while every tick is either cooldown or deferred (pane_busy). There is no hard stop / forced interrupt on the auto path — /clear is unwired./compact keeps Claude's summary but the orchestrator holds live coordination state (in-flight dispatches, ticket decisions, pending escalations) that a generic summarizer will not reliably preserve. The CLAUDE.md "Compact Instructions" block tells Claude what to preserve, but that is best-effort, not enforced./compact (lossy) until 03:00./clear snapshot is write-only. Continuity snapshot is stored but never restored on the fresh session — a missed opportunity.watchdog_tick(): soft 700k → /compact; if still ≥ a new "critical" threshold (e.g. 850k) after the next tick → forced interrupt + /compact; hard 900k → clear_persistent_luci() and re-inject the continuity snapshot into the fresh prompt. Right now hard just calls /compact again — same lossy action, no fallback./clear or rotation, prepend the stored snapshot (plus open ticket/dispatch state queried live from MC) into the first message of the fresh session. MC already has the ground-truth state — lean on the DB, not the transcript.~/.claude/settings-worker.json. persistent_luci.py defines WORKER_SETTINGS = ~/.claude/settings-worker.json; _claude_cmd() passes --settings <profile.settings_file or WORKER_SETTINGS>, and every runtime profile in models.py (_runtime_profile_catalog()) sets settings_file = ~/.claude/settings-worker.json. So the orchestrator runs on the worker settings, not the interactive settings.json.settings-worker.json has telegram@claude-plugins-official: false — mandatory so the orchestrator does not poll the Telegram bot and 409-conflict CCGram (the Telegram lock rule). Correct choice for that reason.bypassPermissions + dangerouslyDisableSandbox, Bash(*), full Edit/Write/Read including ~/.claude/**. --dangerously-skip-permissions is appended by _claude_cmd(). So the orchestrator has full unrestricted tool access — Bash, Edit, Write, Read, Grep, Glob, WebFetch, Task, the lot.--allowedTools / --disallowedTools restriction. _claude_cmd() passes --settings and --session-id but no tool allow/deny list. Contrast: a scheduler life-manager subprocess seen in ps aux runs with an explicit --disallowedTools Bash,Read,Write,Edit,... — the orchestrator gets no such scoping.~/.claude/skills/ are all available — full global skill set including heavy implementation skills (dev-loop, feature-dev, playwright, docx, pptx, xlsx, safair-dashboard, etc.).settings-worker.json enables code-review, github, code-simplifier, feature-dev, claude-md-management, skill-creator, context7, frontend-design, firecrawl, caveman, glm-plan-usage, context-mode, claude-mem.~/workspace/CLAUDE.md) is to coordinate and delegate — run the MC board, manage workers, dispatch Larry, report up. But it launches with the full unrestricted toolset of an implementing worker: Bash(*), Edit, Write, feature-dev, playwright, etc. Nothing structurally stops the orchestrator from doing heavy implementation itself instead of dispatching — which is exactly what the dev-loop / subagent discipline rules try to prevent behaviorally.settings-worker.json is shared by ticket workers, scheduler tasks, and the orchestrator. There is no orchestrator-specific profile — so you cannot tune the orchestrator's tool surface without affecting all workers.require-dev-loop hook. Hooks gate dev-loop invocation, not whether the orchestrator should be coding at all.settings-orchestrator.json and point the persistent runtime profile at it (the settings_file field already exists per-profile in models.py — this is a one-line catalog change). Keep telegram: false (lock requirement). Scope it toward coordination: keep Bash, Read, Grep, Glob, MC API access, memory, Task (for dispatch); consider gating or removing direct heavy-implementation skills (feature-dev, playwright, pptx/docx/xlsx builders) so the orchestrator's natural path is to create a ticket / dispatch a worker, not to open the editor.Edit/Write available (the orchestrator legitimately edits task files, CLAUDE.md, MC config) but consider that bulk code implementation should route to a worker. This is a judgement call — err toward keeping write access, since the orchestrator does light config work, but make heavy code skills opt-in.PreToolUse hint when the orchestrator session starts an Edit on a large source file → nudge "dispatch a worker ticket instead." Behavioral, not blocking.settings_file indirection already exists, so it is low-risk.Layer 1 — claude-mem (thedotmack, v13.2.0, installed 2026-05-19 05:47).
- Installed at ~/.claude/plugins/cache/thedotmack/claude-mem/13.2.0; enabled in both settings.json and settings-worker.json (claude-mem@thedotmack: true).
- It is a capture + recall system. Its hooks/hooks.json registers:
- Setup → version-check.
- SessionStart (startup|clear|compact) → starts a background worker service (worker-service.cjs via bun-runner.js).
- (further hooks capture transcript events.)
- Storage is ~/.claude-mem/ — a real, growing store: claude-mem.db is 9.9 MB (+ 4.3 MB WAL), plus a chroma/ vector store, corpora/, observer-sessions/, backups/, worker.pid, supervisor.json. The ~/.claude-mem/ dir also holds 80 MB+ of per-session .jsonl transcript copies. So claude-mem is already capturing aggressively — it ingests transcripts into a SQLite + Chroma vector DB and exposes a mem-search skill for recall.
- It installs its own skills (babysit, mem-search, learn-codebase, make-plan, knowledge-agent, timeline-report, ...) — a parallel skill surface.
- Critically: claude-mem set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 in ~/.claude/settings.json. This disables Claude Code's built-in file-based auto-memory (Layer 4). So installing claude-mem already silently displaced the existing memory layer.
Layer 2 — vault.db. ~/workspace/Vault/vault.db → symlink to ~/workspace/PKA/Vault/vault.db. An FTS5 index built locally from SecondBrain markdown by the vault-index task. Queried by the search-brain skill: wiki-first direct read → vault.db FTS5 keyword search → QMD raw-markdown fallback. 100% local, sub-20 ms, no LLM, no network. Per the 2026-05-01 retrieval eval it covers ~95% of real corpus queries. This is the knowledge-base / SecondBrain retrieval layer.
Layer 3 — SecondBrain. Canonical markdown vault at ~/workspace/PKA/SecondBrain (also ~/workspace/cowork/SecondBrain). The orchestrator reads it via search-brain (wiki-first) and direct Read. vault.db is just its FTS index. Layers 2+3 are one system — a knowledge base + its index. Not a conflict.
Layer 4 — file-based auto-memory. ~/.claude/projects/-home-lucienne-workspace/memory/ — MEMORY.md (12 KB, updated daily) + 78 individual feedback_* / reference_* / project_* / user_* files. This is Claude Code's native auto-memory: cross-session learnings, corrections, operating preferences. As of 2026-05-19 it is DISABLED because claude-mem set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1. The files still exist and are still injected as context (the MEMORY.md block is in the system prompt), but Claude's automatic writing to them is off.
The MC-3731 / MC-3740 bug. MC-3740 ("luci-persistent stuck on claude-mem memory-system decision", status inbox, surfaced 2026-05-19 ~10:00): the persistent session is parked on an interactive prompt — claude-mem's install asked "keep built-in / keep claude-mem / run both" and the orchestrator (non-TTY, driven via tmux send) cannot answer it. The claude-mem worker also "not started (non-TTY)". Impact: while parked, the orchestrator inbox drain is blocked — 534 items pending in orchestrator_inbox and growing, digests not delivered. MC-3729 patched the symptom (pickup-cycle no longer blocks 60s on the drain) but the backlog will not clear until the session is back at a prompt. This is an active outage caused directly by claude-mem.
CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 without anyone deciding that; (c) runs a background worker-service.cjs per session — extra processes on an already OOM-starved box (Q1); (d) adds a parallel skill surface that overlaps existing skills.feedback_memory_system_churn.md and feedback_no_dialectic_memory_layer.md files explicitly record a prior decision: do not pile on new memory infra; the existing memory + vault + skills stack is sufficient. claude-mem violates that decision. feedback_no_local_fork_third_party.md (May 18) is in the same spirit.Remove claude-mem. Concretely:
1. Disable the plugin: set claude-mem@thedotmack: false in both settings.json and settings-worker.json.
2. Re-enable native auto-memory: remove CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 from settings.json (it is NOT in settings-worker.json — but the orchestrator inherits behavior; verify after re-enable).
3. Stop/clean the claude-mem worker service and ~/.claude-mem/ store (back it up first if any unique capture is worth keeping — likely not).
4. Unstick MC-3740: the orchestrator is parked on the claude-mem prompt — answer it (or kill+rotate the session) so the 534-item orchestrator_inbox backlog drains.
Target end-state — three layers, clearly separated:
- vault.db + SecondBrain → knowledge base (Elmar's projects/people/decisions). Query via search-brain. Keep as-is.
- File auto-memory (memory/MEMORY.md + files) → the agent's operating learnings & corrections. Re-enable native writing. Keep — it is git-tracked, human-curated, prompt-injected, and already proven.
- MC database (mc.db runtime_sessions metadata, activity_log, ticket history) → operational/continuity state (watchdog history, continuity snapshots, dispatch state). This is the right place for orchestrator working-state continuity (see Q2).
claude-mem is redundant with this stack, is not safe to keep given MC-3740, and contradicts a prior explicit decision. It should go.
| # | Priority | Fix | Why |
|---|---|---|---|
| 1 | P0 | Remove claude-mem: disable plugin in both settings files, drop CLAUDE_CODE_DISABLE_AUTO_MEMORY=1, stop worker, clean ~/.claude-mem/. Unstick MC-3740 so the 534-item orchestrator_inbox backlog drains. |
Active outage; silently displaced working memory; OOM contributor; violates prior decision. |
| 2 | P0 | Add real liveness supervision for luci-persistent: keep-alive check (every ~5 min) that calls ensure_started() + Telegram-alerts if the session is dead; add it to systemd_watchdog.py's watched set. |
Today a dead orchestrator (OOM/crash/reboot) can sit dead up to 30 min with no alert. |
| 3 | P1 | Fix OOM root cause: cap concurrent heavy claude[1m] workers and/or add OOMScoreAdjust so the orchestrator is killed last; revisit box RAM (7.6 GiB, 457 MiB free, swap 47% used). |
Daily [operator:memory-pressure] tickets + gunicorn SIGKILLs; chronic, not transient. |
| 4 | P1 | Wire a real context-escalation ladder into watchdog_tick(): soft→/compact, critical→forced-interrupt+/compact, hard→clear_persistent_luci() with continuity-snapshot re-injection. Adaptive (shorter) cooldown above soft threshold. |
Today soft and hard both just /compact; /clear + the continuity snapshot are dead code on the auto path. |
| 5 | P2 | Give the orchestrator its own settings-orchestrator.json (the per-profile settings_file hook already exists in models.py). Scope toward coordination; gate heavy-implementation skills so the natural path is dispatch, not self-implement. |
Orchestrator currently launches with a full unrestricted worker toolset — role/capability mismatch, and self-implementation worsens the single-window token problem. |
| 6 | P2 | Make orchestrator continuity load-bearing: on rotation//clear, re-inject continuity snapshot + live MC ticket/dispatch state into the fresh session's first prompt. |
Rotation/compaction currently lose working state; the continuity snapshot is captured but never restored. |
| 7 | P2 | Reconcile MC-3718 chat-thread direction with the single-window context budget — route chat turns through a separate short-lived session, or budget for more frequent stateful rotation. | A high-traffic chat surface merged into the one persistent window accelerates token growth and state loss. |