⌂ Home ☷ Board

Luci Orchestrator — Robustness & Design Evaluation

Research-only assessment of the luci-persistent tmux session (the orchestrator). No code/config changed. Date: 2026-05-19.

Scope: the persistent Claude Code session driving Mission Control coordination on Hetzner host openclaw. Files cited are ground truth; line refs are from the indexed snapshot.

CORRECTION (2026-05-19, post-eval): The Q4 conclusion "claude-mem is harmful, remove it" is WRONG — it read state from before Lucienne's fix. claude-mem was installed deliberately by Lucienne (Chief of Staff); her commit 2e082fe0 neutralised claude-mem's attempt to disable the auto-memory layer — both run side by side now. claude-mem is tested and working (session capture, summaries, vector search) with a dashboard tile (Apps → Tools → Claude-Mem Viewer, :37701). claude-mem stays. The MC-3731 stuck-session incident was a real but since-resolved past event. Q1/Q2/Q3 findings (liveness, context, toolset) are unaffected and stand.


Q1 — Staying Alive

HOW IT WORKS NOW

GAPS / RISKS

  1. No dedicated liveness supervisor. Resurrection is lazy and incidental — it piggybacks on send() and on luci_operator's 30-min sweep. After an OOM kill or crash, the orchestrator can be dead for up to 30 min with no alert (the systemd-watchdog only watches ccgram, luci-dashboard, luci-token-dashboard, price-watch, smart-moneynot the persistent session).
  2. The "watchdog" is misnamed. persistent-luci-watchdog only does token compaction. A reader expects it to also restart a dead session; it does not. No-token-footer / no-session both return noop.
  3. OOM risk is real and active. free -h: 7.6 GiB RAM, 5.1 GiB used, 457 MiB free; swap 9 GiB with 4.3 GiB already used. ps aux shows 8+ concurrent claude processes each 250–460 MB RSS. journalctl shows gunicorn workers SIGKILLed ("Perhaps out of memory?") on May 19 at 03:17 and 08:12. The recurring [operator:memory-pressure] tickets (MC-3330/3458/3541/3612/3641/3682/3708/3733 — roughly daily) confirm chronic pressure. The box is genuinely memory-starved; the orchestrator is one OOM victim among many.
  4. No memory cap / oom_score tuning. The persistent claude has no MemoryMax, no OOMScoreAdjust. Under pressure the kernel may kill the orchestrator as readily as a throwaway worker.

RECOMMENDATION


Q2 — Context Management

HOW IT WORKS NOW

GAPS / RISKS

  1. /compact at 700k is conservative but the cooldown can starve it. 5-min cooldown + watchdog every 5 min means at most one compact per 5 min. If the orchestrator generates tokens fast (large tool outputs, chat-thread context), it can climb from 700k past 900k and keep climbing while every tick is either cooldown or deferred (pane_busy). There is no hard stop / forced interrupt on the auto path — /clear is unwired.
  2. Compaction loses orchestrator working state. /compact keeps Claude's summary but the orchestrator holds live coordination state (in-flight dispatches, ticket decisions, pending escalations) that a generic summarizer will not reliably preserve. The CLAUDE.md "Compact Instructions" block tells Claude what to preserve, but that is best-effort, not enforced.
  3. Chat-thread context (MC-3718/3754) makes it worse. MC-3718's design adds a persistent chat thread + orchestrator-initiated intent decomposition — every Elmar composer message and every digest now flows into the same persistent context window. More inbound traffic into a single long-lived window = faster token growth = more frequent compaction = more state loss. The chat-first direction is on a collision course with the single-window design.
  4. No mid-day rotation safety valve. Rotation is once daily. If context blows up at 14:00, the only tools are /compact (lossy) until 03:00.
  5. /clear snapshot is write-only. Continuity snapshot is stored but never restored on the fresh session — a missed opportunity.

RECOMMENDATION


Q3 — Tools & Skills

HOW IT WORKS NOW

GAPS / RISKS

  1. Role/capability mismatch. The orchestrator's job (per ~/workspace/CLAUDE.md) is to coordinate and delegate — run the MC board, manage workers, dispatch Larry, report up. But it launches with the full unrestricted toolset of an implementing worker: Bash(*), Edit, Write, feature-dev, playwright, etc. Nothing structurally stops the orchestrator from doing heavy implementation itself instead of dispatching — which is exactly what the dev-loop / subagent discipline rules try to prevent behaviorally.
  2. Implementation in the persistent window is a context-budget hazard. Heavy implementation (big file reads, test runs, browser automation) dumps large tool outputs into the single long-lived context window — directly worsening the Q2 token-growth problem. A worker that does this gets thrown away; the orchestrator does not.
  3. Worker vs orchestrator settings are the same file. settings-worker.json is shared by ticket workers, scheduler tasks, and the orchestrator. There is no orchestrator-specific profile — so you cannot tune the orchestrator's tool surface without affecting all workers.
  4. No tool gate enforces "delegate, don't implement." It is purely a behavioral instruction in CLAUDE.md + the require-dev-loop hook. Hooks gate dev-loop invocation, not whether the orchestrator should be coding at all.

RECOMMENDATION


Q4 — Memory System Evaluation (Four Layers)

HOW IT WORKS NOW

Layer 1 — claude-mem (thedotmack, v13.2.0, installed 2026-05-19 05:47). - Installed at ~/.claude/plugins/cache/thedotmack/claude-mem/13.2.0; enabled in both settings.json and settings-worker.json (claude-mem@thedotmack: true). - It is a capture + recall system. Its hooks/hooks.json registers: - Setup → version-check. - SessionStart (startup|clear|compact) → starts a background worker service (worker-service.cjs via bun-runner.js). - (further hooks capture transcript events.) - Storage is ~/.claude-mem/ — a real, growing store: claude-mem.db is 9.9 MB (+ 4.3 MB WAL), plus a chroma/ vector store, corpora/, observer-sessions/, backups/, worker.pid, supervisor.json. The ~/.claude-mem/ dir also holds 80 MB+ of per-session .jsonl transcript copies. So claude-mem is already capturing aggressively — it ingests transcripts into a SQLite + Chroma vector DB and exposes a mem-search skill for recall. - It installs its own skills (babysit, mem-search, learn-codebase, make-plan, knowledge-agent, timeline-report, ...) — a parallel skill surface. - Critically: claude-mem set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 in ~/.claude/settings.json. This disables Claude Code's built-in file-based auto-memory (Layer 4). So installing claude-mem already silently displaced the existing memory layer.

Layer 2 — vault.db. ~/workspace/Vault/vault.db → symlink to ~/workspace/PKA/Vault/vault.db. An FTS5 index built locally from SecondBrain markdown by the vault-index task. Queried by the search-brain skill: wiki-first direct read → vault.db FTS5 keyword search → QMD raw-markdown fallback. 100% local, sub-20 ms, no LLM, no network. Per the 2026-05-01 retrieval eval it covers ~95% of real corpus queries. This is the knowledge-base / SecondBrain retrieval layer.

Layer 3 — SecondBrain. Canonical markdown vault at ~/workspace/PKA/SecondBrain (also ~/workspace/cowork/SecondBrain). The orchestrator reads it via search-brain (wiki-first) and direct Read. vault.db is just its FTS index. Layers 2+3 are one system — a knowledge base + its index. Not a conflict.

Layer 4 — file-based auto-memory. ~/.claude/projects/-home-lucienne-workspace/memory/MEMORY.md (12 KB, updated daily) + 78 individual feedback_* / reference_* / project_* / user_* files. This is Claude Code's native auto-memory: cross-session learnings, corrections, operating preferences. As of 2026-05-19 it is DISABLED because claude-mem set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1. The files still exist and are still injected as context (the MEMORY.md block is in the system prompt), but Claude's automatic writing to them is off.

The MC-3731 / MC-3740 bug. MC-3740 ("luci-persistent stuck on claude-mem memory-system decision", status inbox, surfaced 2026-05-19 ~10:00): the persistent session is parked on an interactive prompt — claude-mem's install asked "keep built-in / keep claude-mem / run both" and the orchestrator (non-TTY, driven via tmux send) cannot answer it. The claude-mem worker also "not started (non-TTY)". Impact: while parked, the orchestrator inbox drain is blocked — 534 items pending in orchestrator_inbox and growing, digests not delivered. MC-3729 patched the symptom (pickup-cycle no longer blocks 60s on the drain) but the backlog will not clear until the session is back at a prompt. This is an active outage caused directly by claude-mem.

ASSESSMENT — Do the four layers conflict?

RECOMMENDATION

Remove claude-mem. Concretely: 1. Disable the plugin: set claude-mem@thedotmack: false in both settings.json and settings-worker.json. 2. Re-enable native auto-memory: remove CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 from settings.json (it is NOT in settings-worker.json — but the orchestrator inherits behavior; verify after re-enable). 3. Stop/clean the claude-mem worker service and ~/.claude-mem/ store (back it up first if any unique capture is worth keeping — likely not). 4. Unstick MC-3740: the orchestrator is parked on the claude-mem prompt — answer it (or kill+rotate the session) so the 534-item orchestrator_inbox backlog drains.

Target end-state — three layers, clearly separated: - vault.db + SecondBrain → knowledge base (Elmar's projects/people/decisions). Query via search-brain. Keep as-is. - File auto-memory (memory/MEMORY.md + files) → the agent's operating learnings & corrections. Re-enable native writing. Keep — it is git-tracked, human-curated, prompt-injected, and already proven. - MC database (mc.db runtime_sessions metadata, activity_log, ticket history) → operational/continuity state (watchdog history, continuity snapshots, dispatch state). This is the right place for orchestrator working-state continuity (see Q2).

claude-mem is redundant with this stack, is not safe to keep given MC-3740, and contradicts a prior explicit decision. It should go.


Prioritised Recommended Fixes (each → a ticket)

# Priority Fix Why
1 P0 Remove claude-mem: disable plugin in both settings files, drop CLAUDE_CODE_DISABLE_AUTO_MEMORY=1, stop worker, clean ~/.claude-mem/. Unstick MC-3740 so the 534-item orchestrator_inbox backlog drains. Active outage; silently displaced working memory; OOM contributor; violates prior decision.
2 P0 Add real liveness supervision for luci-persistent: keep-alive check (every ~5 min) that calls ensure_started() + Telegram-alerts if the session is dead; add it to systemd_watchdog.py's watched set. Today a dead orchestrator (OOM/crash/reboot) can sit dead up to 30 min with no alert.
3 P1 Fix OOM root cause: cap concurrent heavy claude[1m] workers and/or add OOMScoreAdjust so the orchestrator is killed last; revisit box RAM (7.6 GiB, 457 MiB free, swap 47% used). Daily [operator:memory-pressure] tickets + gunicorn SIGKILLs; chronic, not transient.
4 P1 Wire a real context-escalation ladder into watchdog_tick(): soft→/compact, critical→forced-interrupt+/compact, hard→clear_persistent_luci() with continuity-snapshot re-injection. Adaptive (shorter) cooldown above soft threshold. Today soft and hard both just /compact; /clear + the continuity snapshot are dead code on the auto path.
5 P2 Give the orchestrator its own settings-orchestrator.json (the per-profile settings_file hook already exists in models.py). Scope toward coordination; gate heavy-implementation skills so the natural path is dispatch, not self-implement. Orchestrator currently launches with a full unrestricted worker toolset — role/capability mismatch, and self-implementation worsens the single-window token problem.
6 P2 Make orchestrator continuity load-bearing: on rotation//clear, re-inject continuity snapshot + live MC ticket/dispatch state into the fresh session's first prompt. Rotation/compaction currently lose working state; the continuity snapshot is captured but never restored.
7 P2 Reconcile MC-3718 chat-thread direction with the single-window context budget — route chat turns through a separate short-lived session, or budget for more frequent stateful rotation. A high-traffic chat surface merged into the one persistent window accelerates token growth and state loss.