Second-opinion review — Luci Control Room and Runtime Independence Plan

1. Executive verdict

Plan is sound, scoped well, and Phase 1 already landed cleanly. The governance/navigation framing fixes the main risk (control room becoming a second source of truth). Direction is correct: proceed with amendments. Most gaps are about what isn't yet named — Claude Code itself as a runtime peer to Hermes, the rare-runtime-failure recovery loop, and the runtime_sessions ledger contract — rather than what's wrong in what's written.

2. Must-fix before Phase 1 sign-off

Phase 1 is marked done. Two items should land before declaring it shipped:

Reverse link gap is acknowledged but unresolved. runtime-architecture-refresh.md does not point back to agent-control-room/docs/runtime-independence.md. Until Atlas signoff lands the reverse link, the canonical contract is unaware of the governance layer — anyone reading MC docs first will not discover the control room. Open the Atlas review ticket explicitly now so the Phase 1 box isn't quietly left half-checked.
CCGram-sole-poller invariant is not codified in runtime-independence.md. Phase 5A defers Telegram routing, but the invariant "CCGram is the only inbound poller; any other process invoking claude must use --settings ~/.claude/settings-worker.json" is a runtime-independence constraint now, not a Phase 5A decision. If a future runtime adapter forgets this, the 2026-04-16 409-Conflict outage repeats. Add it to the principle section as a non-negotiable adapter constraint.

3. Should-fix soon

runtime_sessions ledger is named as the invariant but not contracted. Plan says runtime adapters write to runtime_sessions, but nowhere defines the minimum row contract (session id, ticket id, profile, started_at, harvest path, terminal status). Without that, every new adapter re-invents the schema and runtime independence fails silently on read-back.
Claude Code is not enumerated as a first-class runtime. The inventory lists claude_anthropic, claude_glm, etc. — all of which are the Claude CLI with different providers. The interactive Luci-persistent tmux session is also Claude Code, and the dispatcher spawns Claude Code subprocesses. The plan talks about Hermes vs Codex vs Gemini as adapters but never names Claude-Code-the-runtime distinctly from the CLI-that-routes-providers. This conflation is exactly the kind of Claude-specific assumption Phase 3 says to surface — name it explicitly.
Worktree pool (MC-3840) is missing from runtime considerations. Workers claim slots from ~/workspace/.claude/worktrees/pool-{0,1,2}; persistent session never claims. A runtime adapter that doesn't honor "commit + push before DONE" destroys uncommitted work at next claim. This belongs in the adapter contract.
Smoke-test commands deferred to Phase 3 but the inventory already implies them. audit_task_runtime_profiles.py --lint is referenced; surface it now as the runtime-honesty smoke test rather than waiting for Phase 3.
No "runtime adapter must emit terminal-state contract" rule. The ~/.claude/rules/agent-recovery-and-loop-discipline.md terminal-state shape (status / summary / next_actions / artifacts) is the de facto contract for harvested subagent output. Plan doesn't reference it. Without that link, a Codex or Gemini adapter can satisfy the plan's letter but break Luci's recovery loop.

4. Optional / later improvements

Add a one-paragraph "what counts as a runtime swap test" rubric — e.g., "swap claude_glm in for claude_anthropic on a non-prod scheduled task; verify ticket history reads identical; verify cost-record format unchanged." Concretizes the Phase 7 acceptance test #2.
Consider a docs/glossary.md distinguishing: runtime (Claude Code, Hermes CLI, Codex CLI, Gemini CLI), provider (Anthropic, xAI, Z.AI, Moonshot, MiniMax, Google), model (sonnet, gpt-5.5, grok-4.3, glm-4.6), profile (the named adapter binding the three). Plan uses these terms interchangeably in places.
Phase 5 config-change checklist would benefit from a hermes config show diff snapshot habit before/after, to make WebUI drift git-diffable.
Optional cross-link to MC-3659 (CAPABILITIES.md injection rule) — the control room README implicitly relies on that injection working.

5. Incorrect assumptions or stale architecture

"Hermes can load Claude skills via skills.external_dirs" (line 121). Treated as a current strength. Verify before relying on it — cross-host-skill-port skill exists precisely because cross-provider skill loading is unreliable in practice. If this isn't smoke-tested it is aspirational; the plan's own rule forbids aspirational claims.
"Gemini CLI retirement" is referenced in Phase 3 (line 277) as a hypothetical, but per recent context (obs 1685–1687, 1721) Gemini CLI deprecation is already on a 2026-06-18 timeline and was retired from persistent_luci. Update wording from "fallback plan for Gemini CLI retirement" to "Gemini CLI is already deprecating 2026-06-18; document the active migration path" — this is no longer hypothetical and Phase 3 should reflect that priority.
gpt-5.5 (line 347) and grok-4.3 (line 347): verify these model identifiers are current Hermes-resolvable names; if Hermes config drifts the inventory rots.
Plan says "runtime profiles are not yet treated as a clean portability layer" (line 128) — but dispatch_policy.py::forbidden_runtime_profiles and the audit_task_runtime_profiles.py --lint rule already give meaningful portability enforcement. The weakness is documentation, not absence. Sharpen the wording so Phase 3 doesn't redesign what exists.

6. Runtime-independence risks

Adapter contract is implicit. "Profile + adapter" is named but never specified. Without (1) the runtime_sessions row contract, (2) the terminal-state shape, (3) the harvest path discipline, (4) the worktree-pool commit-before-DONE rule, "swap a runtime by updating profile/adapter" is not actually testable.
Direct-API profiles drift risk. direct_gemini, direct_anthropic_sdk, direct_mixed bypass the CLI and therefore bypass any provider-routing env the scheduler injects. The runtime-profile-honesty rule (CLAUDE.md key rule #8) is the only thing keeping these honest. The lint must run on every commit that touches ~/workspace/tasks/; plan should escalate that from a Phase 4 audit to a continuous gate.
Larry profile is read from an HTTP endpoint at pickup time (/api/provider/larry). If MC is down or that endpoint hangs, pickup uses stale defaults. Plan should note that Larry adapter readiness depends on MC liveness — a circular dependency worth flagging.
Persistent session vs ticket session conflation. Persistent Luci anchors at ~/workspace; pool workers anchor at pool-N/. A runtime adapter that doesn't distinguish these will either contaminate the persistent branch or fail to commit at all. Make this an adapter-contract item.
No fallback when all CLI runtimes fail. Plan lists claude_glm as the CLI fallback but doesn't define what happens if Anthropic and Z.AI are both unreachable. The cost-band design needs a "local/fallback" tier (mentioned in Phase 4 line 287); make it concrete — what runs offline / on degraded providers / in 429 storms?

7. Telegram / CCGram / routing risks

Plan correctly defers routing semantics to Phase 5A but understates urgency. Telegram routing is already the most fragile cross-runtime contract: any new adapter that touches outbound Telegram via anything other than notify.py POST, or that adds any getUpdates polling, breaks CCGram immediately. This is a Day-0 invariant, not a Phase 5A deliverable.
mc_telegram_bridge.py::runtime_profiles() is listed under MC registry surfaces. The bridge service is stopped/disabled (per CLAUDE.md, since MC-2617 2026-04-29). If code remains but the service is dead, the inventory is misleading — confirm whether this is live, vestigial, or being re-purposed.
Home-channel vs ticket-topic ambiguity is named but no collision matrix exists yet. Until one does, runtime swap testing cannot validate routing parity. Worth promoting to Must-fix-soon if a runtime swap is contemplated before Phase 5A.
Worker-settings rule visibility. The "every long-running claude process must use --settings ~/.claude/settings-worker.json" rule lives in CLAUDE.md but not in runtime-independence.md. Phase 1 should link it.

8. Storage-boundary risks

The four-layer boundary (reports/ vs mission-control/docs/ vs manifest/CAPABILITIES vs agent-control-room/) is well-drawn but has unclosed seams:

agent-control-room → manifest mirror rule is one-way. Plan says control room links to manifest. But the inverse — when a control-room doc changes, does anything update CAPABILITIES? — is undefined. CAPABILITIES.md's own rule (key rule #9) says any new service/skill/infra must update it. Control-room docs themselves are now infra; explicitly add an entry in CAPABILITIES.md so future Luci finds the control room via the always-injected manifest.
reports/README.md is checked off but unverified in this review. The plan marks it [x]; quick spot-check it actually exists and is current, otherwise Phase 6 is a paper completion.
MC docs vs control-room governance docs can drift. runtime-architecture-refresh.md is canonical; runtime-independence.md is governance. If MC architecture changes and governance doc isn't re-read, governance silently goes stale. Add a "validity window" or "last reviewed against runtime-architecture-refresh.md" line to governance docs.
luci-manifest.md ↔ CAPABILITIES.md split is ambiguous. Manifest = "deployed inventory"; CAPABILITIES = "deployed inventory + capabilities". The plan treats them as siblings but they overlap. Worth a one-line authority statement: which one wins when they disagree?
No archive policy for reports. If reports/ accumulates plans that become contracts, the boundary "if it becomes a contract, mirror into mission-control/docs/" needs an originating-report fate rule: archived, marked superseded, or kept as historical record?

9. Missing workers / runtimes / tooling

Claude Code itself is missing as a named runtime. The plan lists Hermes, Codex, Gemini, Kimi, GLM, browser agents, direct APIs — but the actual running interactive session and every dispatched worker is Claude Code. Phase 3 must name it.
Tessa as a runtime. Tessa is described as a QA validator (key gate) but is itself a Claude subagent runtime with a defined harness. If Tessa's harness changes (or moves cross-host per the cross-host-skill-port skill), gate-3 (UI) silently breaks. Add Tessa to the runtime inventory.
Atlas signoff path is a missing runtime. Architecture/docs gate routes to Atlas — but Atlas lives on Lucienne's Mac. If Mac is asleep, Phase 1's deferred reverse link can't complete. Plan doesn't acknowledge this dependency.
Browser-harness as a runtime. Listed in CAPABILITIES but not in the runtime inventory. UI-QA and scraping tasks need it; if it's not on the inventory, future runtime decisions ignore it.
Worktree pool manager (MC-3840) as runtime infrastructure. Not a runtime per se but the harness every CLI worker runs inside. Belongs in the inventory.
Council / second-opinion as a runtime cluster. Codex + Opus + Gemini + Kimi + GLM in parallel is a recurring multi-runtime invocation pattern. Treat the council as a named composite runtime in the registry so cost-band routing can account for it.
Hermes WebUI itself. Plan governs WebUI but doesn't classify the WebUI as a runtime surface that can take action. If Elmar can change provider via Hermes WebUI and that mutates ~/.hermes/config.yaml, the WebUI is effectively a write path into runtime config — declare it.

10. Concrete recommended patches to the living plan

Add a Phase 1.5: adapter contract. Before Phase 2/3 work begins, add a small subsection in docs/runtime-independence.md titled "Runtime adapter contract" enumerating: (a) runtime_sessions row shape, (b) terminal-state output shape, (c) harvest commit-before-DONE rule, (d) CCGram-sole-poller rule, (e) --settings settings-worker.json rule for any long-running claude invocation, (f) worktree-pool slot anchoring rule, (g) Telegram outbound = notify.py only.
Rewrite Phase 3 Gemini-CLI bullet to reflect already-deprecating-2026-06-18 status (obs 1721 already retired it from persistent_luci). Move it from "future fallback plan" to "active migration."
Promote runtime-profile lint from Phase 4 audit task to a continuous gate: add an entry to Phase 1 marking audit_task_runtime_profiles.py --lint as the smoke test the control room exposes.
Add Claude Code as named runtime. In the Phase 1 inventory, separate "Claude Code (the CLI binary)" from "claude_anthropic (a profile of that binary)" — they are different abstractions.
Add to Phase 1 the explicit Atlas-signoff ticket for the reverse link from runtime-architecture-refresh.md. Don't carry it as a deferred bullet — make it a tracked sub-task with the same MC-3898 parent.
Add a CAPABILITIES.md entry for the control room itself so it's discoverable via the always-injected manifest. (Not just a link from the plan — an entry in the table.)
Strengthen open-decision wording. "Should the control room live in ~/workspace/agent-control-room/?" — current state in ~/workspace/ is consistent with the rest of Luci's home; reword from open question to ratified default unless Elmar contradicts.
Add a "validity-window" header to both governance docs: "Valid as of ; re-check against runtime-architecture-refresh.md and scheduler.py PROFILE_PROVIDER on or before ." Forces drift detection.
Cross-link ~/.claude/rules/agent-recovery-and-loop-discipline.md from runtime-independence.md so terminal-state contract is reachable from the governance layer.
Fix one wording risk in line 165: "no code edits for direct API profiles" — actually direct_anthropic_sdk could in principle edit files if the script does it. The real constraint is "direct API profiles MUST declare direct_* runtime_profile and MUST NOT be assigned tasks routed through the claude CLI dispatcher's tool/file-edit assumptions." Tighten or it misleads adapter authors.

11. Final recommendation

Proceed with amendments.

The plan's direction is correct and Phase 1 work is real. The amendments are mostly about making implicit contracts explicit — adapter contract, Claude-Code-as-runtime naming, Gemini-CLI status update, CCGram invariant promotion, Atlas reverse-link as a tracked sub-task, and CAPABILITIES.md entry for the control room itself. None require redesign; all should land before Phase 2 begins so the inventory work doesn't bake in ambiguity. Do not start Phase 4 (model/cost routing) until Phase 3 inventory + adapter contract are explicit, otherwise the routing decisions land on a registry that doesn't fully define what it's routing.