Luci Orchestrator — Robustness & Design Evaluation

Research-only assessment of the luci-persistent tmux session (the orchestrator). No code/config changed. Date: 2026-05-19.

Scope: the persistent Claude Code session driving Mission Control coordination on Hetzner host openclaw. Files cited are ground truth; line refs are from the indexed snapshot.

CORRECTION (2026-05-19, post-eval): The Q4 conclusion "claude-mem is harmful, remove it" is WRONG — it read state from before Lucienne's fix. claude-mem was installed deliberately by Lucienne (Chief of Staff); her commit 2e082fe0 neutralised claude-mem's attempt to disable the auto-memory layer — both run side by side now. claude-mem is tested and working (session capture, summaries, vector search) with a dashboard tile (Apps → Tools → Claude-Mem Viewer, :37701). claude-mem stays. The MC-3731 stuck-session incident was a real but since-resolved past event. Q1/Q2/Q3 findings (liveness, context, toolset) are unaffected and stand.

Q1 — Staying Alive

HOW IT WORKS NOW

It is a tmux session, not a systemd service. No luci-persistent.service exists. systemctl list-units | grep luci shows only luci-dashboard, luci-heartbeat, luci-scheduler, luci-token-dashboard. The session lives inside the mc-tmux-server.service tmux server (~/.config/systemd/user/mc-tmux-server.service).
persistent_luci.ensure_started() (~/workspace/mission-control/persistent_luci.py) is the start/heal primitive. It resolves a runtime profile, runs a preflight, builds the claude command via _command_for_profile() → _claude_cmd(), then calls mc_tmux.start_session(TMUX_TARGET, command, cwd, recreate=...). It records a row in runtime_sessions (SESSION_KEY = persistent:luci).
Self-heal on degraded session: ensure_started() sets recreate=True if the tmux session exists but _tmux_runtime_ready() is false (a session that exists but is not a live Claude runtime). status() marks the runtime row stale with close_reason=tmux_session_missing / tmux_session_not_runtime when tmux is gone or not a runtime.
ensure_started() callers: persistent_luci_watchdog.py, persistent_luci.send() itself (every inbound message calls ensure_started(recreate=False) first), and ticket_runtime.py. So whenever a message is routed to the orchestrator, a dead session is lazily resurrected.
Watchdog task: ~/workspace/tasks/persistent-luci-watchdog.md runs python3 scripts/persistent_luci_watchdog.py every 5 min (*/5 * * * *). That script calls only persistent_luci.watchdog_tick() — token management, not liveness. Its docstring: "Compact at 700k/900k. No /clear." It is NOT a keep-alive: watchdog_tick() returns noop / no_tmux_session if the session is dead and does nothing else.
rotate-luci-session (daily 03:00 SAST, scripts/rotate-luci-session.sh) deliberately kills the claude process and clears the pinned session_id so it respawns fresh from CLAUDE.md + memory. This is intentional rotation, not crash recovery.
luci-operator (every 30 min, luci_operator.py) audits tmux/service state and runtime sessions and "heals routine safe state drift directly" — this is the closest thing to a liveness watchdog, but it is a broad MC-health pass, not a dedicated persistent-session supervisor.
Reboot: mc-tmux-server.service restarts the tmux server on boot, but nothing on that path runs ensure_started() — the persistent window is only recreated lazily on the next send() or watchdog tick that calls ensure_started. Actually the watchdog never calls ensure_started (only watchdog_tick), so post-reboot the session stays dead until a message is sent or luci_operator heals it.

GAPS / RISKS

No dedicated liveness supervisor. Resurrection is lazy and incidental — it piggybacks on send() and on luci_operator's 30-min sweep. After an OOM kill or crash, the orchestrator can be dead for up to 30 min with no alert (the systemd-watchdog only watches ccgram, luci-dashboard, luci-token-dashboard, price-watch, smart-money — not the persistent session).
The "watchdog" is misnamed. persistent-luci-watchdog only does token compaction. A reader expects it to also restart a dead session; it does not. No-token-footer / no-session both return noop.
OOM risk is real and active. free -h: 7.6 GiB RAM, 5.1 GiB used, 457 MiB free; swap 9 GiB with 4.3 GiB already used. ps aux shows 8+ concurrent claude processes each 250–460 MB RSS. journalctl shows gunicorn workers SIGKILLed ("Perhaps out of memory?") on May 19 at 03:17 and 08:12. The recurring [operator:memory-pressure] tickets (MC-3330/3458/3541/3612/3641/3682/3708/3733 — roughly daily) confirm chronic pressure. The box is genuinely memory-starved; the orchestrator is one OOM victim among many.
No memory cap / oom_score tuning. The persistent claude has no MemoryMax, no OOMScoreAdjust. Under pressure the kernel may kill the orchestrator as readily as a throwaway worker.

RECOMMENDATION

Add a real liveness check to persistent-luci-watchdog (or a new persistent-luci-keepalive task at */5): if persistent_luci.status() reports not-online, call ensure_started() and Telegram-alert. This closes the post-reboot / post-OOM dead window from ~30 min to ~5 min.
Add luci-persistent health to systemd_watchdog.py's watched set (it currently only watches systemd units — add a runtime-session probe).
Address OOM at the root: the box needs more RAM or fewer concurrent heavy claude[1m] workers. Cap concurrent ticket workers, or set OOMScoreAdjust=-500 on the tmux-server unit so the orchestrator is the last thing killed, not a coin-flip.
Do NOT convert the persistent session to a bare systemd Restart=always service — it must stay inside the shared mc-tmux-server so the CCGram Telegram lock and console attach keep working. Keep tmux; add supervision around it.

Q2 — Context Management

HOW IT WORKS NOW

Thresholds confirmed (persistent_luci.py, top of file): SOFT_TOKEN_THRESHOLD = 700_000, HARD_TOKEN_THRESHOLD = 900_000, WATCHDOG_COOLDOWN_SECONDS = 5*60.
watchdog_tick() captures the tmux pane, runs parse_persistent_token_count() against TOKEN_FOOTER_PATTERN ((\d+\.?\d*)\s*k?\s*tokens) to read the live context size off the status footer. Logic:
tokens >= hard (900k) → compact_persistent_luci() (after cooldown check).
tokens >= soft (700k) → compact_persistent_luci() (after cooldown check).
Note: both soft and hard call /compact — the watchdog script docstring explicitly says "No /clear." clear_persistent_luci() exists in the file but is not wired into watchdog_tick(). /clear is dead code on the auto path.
compact_persistent_luci() sends /compact to the tmux pane. It defers (action: deferred, reason: pane_busy) if _pane_has_busy_indicator() is true — it will not interrupt a running turn.
clear_persistent_luci() (manual / unused) records a _continuity_snapshot() (last 4000 chars of sanitized pane) into watchdog_history metadata before sending /clear, and will interrupt a busy pane.
Session rotation (rotate-luci-session, daily 03:00) is the real context reset: kills the process, drops the pinned session_id, respawns fresh — Claude reboots from CLAUDE.md + memory/. --resume is deliberately avoided (pinning session_id makes context grow unbounded across days — the lesson from the 48-hour incident, recorded in the task doc).
Continuity: on rotation, continuity = whatever is in CLAUDE.md + the file-based memory/ dir + project_last_rotation.md (so fresh Luci knows the rotation was intentional). On /compact, continuity = Claude's own compaction summary. On /clear (manual only), the 4000-char snapshot is stored in MC metadata but nothing re-injects it — it is a forensic record, not a restore.

GAPS / RISKS

/compact at 700k is conservative but the cooldown can starve it. 5-min cooldown + watchdog every 5 min means at most one compact per 5 min. If the orchestrator generates tokens fast (large tool outputs, chat-thread context), it can climb from 700k past 900k and keep climbing while every tick is either cooldown or deferred (pane_busy). There is no hard stop / forced interrupt on the auto path — /clear is unwired.
Compaction loses orchestrator working state. /compact keeps Claude's summary but the orchestrator holds live coordination state (in-flight dispatches, ticket decisions, pending escalations) that a generic summarizer will not reliably preserve. The CLAUDE.md "Compact Instructions" block tells Claude what to preserve, but that is best-effort, not enforced.
Chat-thread context (MC-3718/3754) makes it worse. MC-3718's design adds a persistent chat thread + orchestrator-initiated intent decomposition — every Elmar composer message and every digest now flows into the same persistent context window. More inbound traffic into a single long-lived window = faster token growth = more frequent compaction = more state loss. The chat-first direction is on a collision course with the single-window design.
No mid-day rotation safety valve. Rotation is once daily. If context blows up at 14:00, the only tools are /compact (lossy) until 03:00.
/clear snapshot is write-only. Continuity snapshot is stored but never restored on the fresh session — a missed opportunity.

RECOMMENDATION

Wire a real escalation ladder into watchdog_tick(): soft 700k → /compact; if still ≥ a new "critical" threshold (e.g. 850k) after the next tick → forced interrupt + /compact; hard 900k → clear_persistent_luci() and re-inject the continuity snapshot into the fresh prompt. Right now hard just calls /compact again — same lossy action, no fallback.
Make the continuity snapshot load-bearing: on /clear or rotation, prepend the stored snapshot (plus open ticket/dispatch state queried live from MC) into the first message of the fresh session. MC already has the ground-truth state — lean on the DB, not the transcript.
Treat MC-3718 chat-thread as a context-budget decision, not just a UX one: either (a) route chat turns through a separate short-lived session and only escalate to the persistent orchestrator on demand, or (b) accept more frequent rotation and make rotation cheap + stateful. Do not merge a high-traffic chat surface into the single persistent window without a budget plan.
Shorten the watchdog cooldown to ~2 min when tokens are above soft threshold (adaptive cooldown), so a fast climb gets more compaction attempts.

Q3 — Tools & Skills

HOW IT WORKS NOW

Settings file: the persistent session launches with ~/.claude/settings-worker.json. persistent_luci.py defines WORKER_SETTINGS = ~/.claude/settings-worker.json; _claude_cmd() passes --settings <profile.settings_file or WORKER_SETTINGS>, and every runtime profile in models.py (_runtime_profile_catalog()) sets settings_file = ~/.claude/settings-worker.json. So the orchestrator runs on the worker settings, not the interactive settings.json.
Why worker settings: settings-worker.json has telegram@claude-plugins-official: false — mandatory so the orchestrator does not poll the Telegram bot and 409-conflict CCGram (the Telegram lock rule). Correct choice for that reason.
Permissions: bypassPermissions + dangerouslyDisableSandbox, Bash(*), full Edit/Write/Read including ~/.claude/**. --dangerously-skip-permissions is appended by _claude_cmd(). So the orchestrator has full unrestricted tool access — Bash, Edit, Write, Read, Grep, Glob, WebFetch, Task, the lot.
No --allowedTools / --disallowedTools restriction. _claude_cmd() passes --settings and --session-id but no tool allow/deny list. Contrast: a scheduler life-manager subprocess seen in ps aux runs with an explicit --disallowedTools Bash,Read,Write,Edit,... — the orchestrator gets no such scoping.
Skills: 150 entries in ~/.claude/skills/ are all available — full global skill set including heavy implementation skills (dev-loop, feature-dev, playwright, docx, pptx, xlsx, safair-dashboard, etc.).
Plugins: settings-worker.json enables code-review, github, code-simplifier, feature-dev, claude-md-management, skill-creator, context7, frontend-design, firecrawl, caveman, glm-plan-usage, context-mode, claude-mem.

GAPS / RISKS

Role/capability mismatch. The orchestrator's job (per ~/workspace/CLAUDE.md) is to coordinate and delegate — run the MC board, manage workers, dispatch Larry, report up. But it launches with the full unrestricted toolset of an implementing worker: Bash(*), Edit, Write, feature-dev, playwright, etc. Nothing structurally stops the orchestrator from doing heavy implementation itself instead of dispatching — which is exactly what the dev-loop / subagent discipline rules try to prevent behaviorally.
Implementation in the persistent window is a context-budget hazard. Heavy implementation (big file reads, test runs, browser automation) dumps large tool outputs into the single long-lived context window — directly worsening the Q2 token-growth problem. A worker that does this gets thrown away; the orchestrator does not.
Worker vs orchestrator settings are the same file. settings-worker.json is shared by ticket workers, scheduler tasks, and the orchestrator. There is no orchestrator-specific profile — so you cannot tune the orchestrator's tool surface without affecting all workers.
No tool gate enforces "delegate, don't implement." It is purely a behavioral instruction in CLAUDE.md + the require-dev-loop hook. Hooks gate dev-loop invocation, not whether the orchestrator should be coding at all.

RECOMMENDATION

Create a dedicated settings-orchestrator.json and point the persistent runtime profile at it (the settings_file field already exists per-profile in models.py — this is a one-line catalog change). Keep telegram: false (lock requirement). Scope it toward coordination: keep Bash, Read, Grep, Glob, MC API access, memory, Task (for dispatch); consider gating or removing direct heavy-implementation skills (feature-dev, playwright, pptx/docx/xlsx builders) so the orchestrator's natural path is to create a ticket / dispatch a worker, not to open the editor.
Keep Edit/Write available (the orchestrator legitimately edits task files, CLAUDE.md, MC config) but consider that bulk code implementation should route to a worker. This is a judgement call — err toward keeping write access, since the orchestrator does light config work, but make heavy code skills opt-in.
Add a soft guardrail: a PreToolUse hint when the orchestrator session starts an Edit on a large source file → nudge "dispatch a worker ticket instead." Behavioral, not blocking.
This is a Tier-2 change at most; the per-profile settings_file indirection already exists, so it is low-risk.

Q4 — Memory System Evaluation (Four Layers)

HOW IT WORKS NOW

Layer 1 — claude-mem (thedotmack, v13.2.0, installed 2026-05-19 05:47). - Installed at ~/.claude/plugins/cache/thedotmack/claude-mem/13.2.0; enabled in both settings.json and settings-worker.json (claude-mem@thedotmack: true). - It is a capture + recall system. Its hooks/hooks.json registers: - Setup → version-check. - SessionStart (startup|clear|compact) → starts a background worker service (worker-service.cjs via bun-runner.js). - (further hooks capture transcript events.) - Storage is ~/.claude-mem/ — a real, growing store: claude-mem.db is 9.9 MB (+ 4.3 MB WAL), plus a chroma/ vector store, corpora/, observer-sessions/, backups/, worker.pid, supervisor.json. The ~/.claude-mem/ dir also holds 80 MB+ of per-session .jsonl transcript copies. So claude-mem is already capturing aggressively — it ingests transcripts into a SQLite + Chroma vector DB and exposes a mem-search skill for recall. - It installs its own skills (babysit, mem-search, learn-codebase, make-plan, knowledge-agent, timeline-report, ...) — a parallel skill surface. - Critically: claude-mem set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 in ~/.claude/settings.json. This disables Claude Code's built-in file-based auto-memory (Layer 4). So installing claude-mem already silently displaced the existing memory layer.

Layer 2 — vault.db. ~/workspace/Vault/vault.db → symlink to ~/workspace/PKA/Vault/vault.db. An FTS5 index built locally from SecondBrain markdown by the vault-index task. Queried by the search-brain skill: wiki-first direct read → vault.db FTS5 keyword search → QMD raw-markdown fallback. 100% local, sub-20 ms, no LLM, no network. Per the 2026-05-01 retrieval eval it covers ~95% of real corpus queries. This is the knowledge-base / SecondBrain retrieval layer.

Layer 3 — SecondBrain. Canonical markdown vault at ~/workspace/PKA/SecondBrain (also ~/workspace/cowork/SecondBrain). The orchestrator reads it via search-brain (wiki-first) and direct Read. vault.db is just its FTS index. Layers 2+3 are one system — a knowledge base + its index. Not a conflict.

Layer 4 — file-based auto-memory. ~/.claude/projects/-home-lucienne-workspace/memory/ — MEMORY.md (12 KB, updated daily) + 78 individual feedback_* / reference_* / project_* / user_* files. This is Claude Code's native auto-memory: cross-session learnings, corrections, operating preferences. As of 2026-05-19 it is DISABLED because claude-mem set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1. The files still exist and are still injected as context (the MEMORY.md block is in the system prompt), but Claude's automatic writing to them is off.

The MC-3731 / MC-3740 bug. MC-3740 ("luci-persistent stuck on claude-mem memory-system decision", status inbox, surfaced 2026-05-19 ~10:00): the persistent session is parked on an interactive prompt — claude-mem's install asked "keep built-in / keep claude-mem / run both" and the orchestrator (non-TTY, driven via tmux send) cannot answer it. The claude-mem worker also "not started (non-TTY)". Impact: while parked, the orchestrator inbox drain is blocked — 534 items pending in orchestrator_inbox and growing, digests not delivered. MC-3729 patched the symptom (pickup-cycle no longer blocks 60s on the drain) but the backlog will not clear until the session is back at a prompt. This is an active outage caused directly by claude-mem.

ASSESSMENT — Do the four layers conflict?

Layers 2+3 (vault.db + SecondBrain): one system, no conflict. Knowledge base + FTS index. Keep.
Layer 4 (file auto-memory): distinct purpose, no conflict with 2+3. SecondBrain = Elmar's knowledge/projects. File-memory = the agent's operating learnings ("don't ask permissions", "deliver files off Luci"). Different data, different purpose. They are complementary, not redundant.
Layer 1 (claude-mem): redundant AND harmful here.
Redundant: claude-mem's value proposition is "capture transcript history + semantic recall." But file-memory already captures agent learnings (curated, human-readable, git-tracked, injected into the prompt), and vault.db already does fast local retrieval over the knowledge base. claude-mem adds a third store (SQLite + Chroma vectors) covering ground the other two already cover, with worse properties: opaque binary DB, not git-tracked, not human-curated, grows unbounded (already 9.9 MB DB + 80 MB transcript copies in 12 hours).
Harmful: it (a) caused an active outage (MC-3740, 534-item backlog) by parking the non-TTY orchestrator on an interactive prompt; (b) silently disabled the working file-memory layer via CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 without anyone deciding that; (c) runs a background worker-service.cjs per session — extra processes on an already OOM-starved box (Q1); (d) adds a parallel skill surface that overlaps existing skills.
The memory feedback_memory_system_churn.md and feedback_no_dialectic_memory_layer.md files explicitly record a prior decision: do not pile on new memory infra; the existing memory + vault + skills stack is sufficient. claude-mem violates that decision. feedback_no_local_fork_third_party.md (May 18) is in the same spirit.

RECOMMENDATION

Remove claude-mem. Concretely: 1. Disable the plugin: set claude-mem@thedotmack: false in both settings.json and settings-worker.json. 2. Re-enable native auto-memory: remove CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 from settings.json (it is NOT in settings-worker.json — but the orchestrator inherits behavior; verify after re-enable). 3. Stop/clean the claude-mem worker service and ~/.claude-mem/ store (back it up first if any unique capture is worth keeping — likely not). 4. Unstick MC-3740: the orchestrator is parked on the claude-mem prompt — answer it (or kill+rotate the session) so the 534-item orchestrator_inbox backlog drains.

Target end-state — three layers, clearly separated: - vault.db + SecondBrain → knowledge base (Elmar's projects/people/decisions). Query via search-brain. Keep as-is. - File auto-memory (memory/MEMORY.md + files) → the agent's operating learnings & corrections. Re-enable native writing. Keep — it is git-tracked, human-curated, prompt-injected, and already proven. - MC database (mc.db runtime_sessions metadata, activity_log, ticket history) → operational/continuity state (watchdog history, continuity snapshots, dispatch state). This is the right place for orchestrator working-state continuity (see Q2).

claude-mem is redundant with this stack, is not safe to keep given MC-3740, and contradicts a prior explicit decision. It should go.

Prioritised Recommended Fixes (each → a ticket)

#	Priority	Fix	Why
1	P0	Remove claude-mem: disable plugin in both settings files, drop `CLAUDE_CODE_DISABLE_AUTO_MEMORY=1`, stop worker, clean `~/.claude-mem/`. Unstick MC-3740 so the 534-item orchestrator_inbox backlog drains.	Active outage; silently displaced working memory; OOM contributor; violates prior decision.
2	P0	Add real liveness supervision for `luci-persistent`: keep-alive check (every ~5 min) that calls `ensure_started()` + Telegram-alerts if the session is dead; add it to `systemd_watchdog.py`'s watched set.	Today a dead orchestrator (OOM/crash/reboot) can sit dead up to 30 min with no alert.
3	P1	Fix OOM root cause: cap concurrent heavy `claude[1m]` workers and/or add `OOMScoreAdjust` so the orchestrator is killed last; revisit box RAM (7.6 GiB, 457 MiB free, swap 47% used).	Daily `[operator:memory-pressure]` tickets + gunicorn SIGKILLs; chronic, not transient.
4	P1	Wire a real context-escalation ladder into `watchdog_tick()`: soft→`/compact`, critical→forced-interrupt+`/compact`, hard→`clear_persistent_luci()` with continuity-snapshot re-injection. Adaptive (shorter) cooldown above soft threshold.	Today soft and hard both just `/compact`; `/clear` + the continuity snapshot are dead code on the auto path.
5	P2	Give the orchestrator its own `settings-orchestrator.json` (the per-profile `settings_file` hook already exists in `models.py`). Scope toward coordination; gate heavy-implementation skills so the natural path is dispatch, not self-implement.	Orchestrator currently launches with a full unrestricted worker toolset — role/capability mismatch, and self-implementation worsens the single-window token problem.
6	P2	Make orchestrator continuity load-bearing: on rotation/`/clear`, re-inject continuity snapshot + live MC ticket/dispatch state into the fresh session's first prompt.	Rotation/compaction currently lose working state; the continuity snapshot is captured but never restored.
7	P2	Reconcile MC-3718 chat-thread direction with the single-window context budget — route chat turns through a separate short-lived session, or budget for more frequent stateful rotation.	A high-traffic chat surface merged into the one persistent window accelerates token growth and state loss.