⌂ Home ☷ Board

Control Room vs MC Runtime Architecture — Recommendation

Date: 2026-05-31
Reviewer: Lucienne (architectural review)
Status: DECISION REQUIRED — Recommendation provided


Executive Summary

Recommend Option 1: Commit fully to the Control Room model.

The hybrid state is already causing confusion, burning tokens, and creating operational drag. The data shows the old runtime model is effectively dormant for Luci (0 active runtime sessions, 0 open Luci tickets in runtime, mc_pickup.py not dispatching Luci tickets), yet the infrastructure still exists, creates maintenance burden, and complicates the mental model. Larry's work can be folded into the Control Room dispatch path without losing capability.

The decisive factor: Elmar designed the Control Room model as the intended architecture. The only reason to keep the hybrid is fear of migration cost — but the migration is smaller than it appears because the runtime is already largely inactive for the main orchestrator path.


Evidence from Live System

Metric Value Interpretation
Active runtime_sessions 0 No live ticket workers running
runtime_sessions (completed) 177,290 Historical record, not active infrastructure
runtime_sessions (failed/stale) 770 + 180 Cleanup debt, not active work
Luci open tickets in MC 3 (MC-3930 waiting, MC-4464 todo, MC-4193 waiting) All manageable by Control Room watcher
Larry open tickets in MC 0 No active Larry runtime work currently
mc_pickup.py Exists, 7,971 lines, 120 functions Heavy legacy codebase
Luci Control Room watcher d237c9eb2a7c, enabled, */5 min, no_agent script Active control room path
Old mc-board-shepherd-5min 7fa17b6a8bad, disabled since 2026-05-29 Old runtime path already shut down
MC control-plane watchdog b35ca4611b00, enabled, every 15m, no_agent Healthy infrastructure monitoring
Iris jobs 8 enabled cron jobs, all no_agent or LLM-driven Iris already operates on notification/event model
tmux sessions mc-root: 2 windows Minimal residual runtime footprint

Key insight: The system is already mostly running in Control Room mode for Luci. The old mc-board-shepherd-5min (the full-LLM runtime orchestrator) was disabled on 2026-05-29. The only remaining runtime activity is residual tmux sessions and the mc_pickup.py codebase itself.


Option Analysis

Option 1: Commit to Control Room (RECOMMENDED)

Why this is correct: - Aligns with Elmar's explicit design intent - The system is already ~90% migrated — finish the job - Eliminates the confusing dual-path mental model - Removes 7,971 lines of legacy dispatcher code (mc_pickup.py) - Lets the orchestrator (Luci) own all routing decisions - Enables consistent behavior across Luci, Iris, and Lucienne

Why the cons are manageable: - "Lose existing runtime infrastructure" — the infrastructure is already dormant. The 177K completed sessions are historical data, not active capability. - "Need to rebuild worker dispatch" — Luci's Control Room watcher already spawns external workers (Codex CLI, Claude Code, subagents). The dispatch capability exists; it's just not going through mc_pickup.py. - Larry's coding tasks can be dispatched the same way: Luci creates a ticket, spawns a worker in a tmux session or SSH session, and tracks it via comments/status.

Option 2: Hybrid (REJECTED)

Why this is wrong: - The current "hybrid" is actually just a messy transition state, not a designed architecture - Every ticket would need a routing decision: "does this go through Control Room or runtime?" This creates confusion and drift - Two status-change paths mean tickets can get stuck in gaps between systems - The cost issue (5-min polling) isn't solved — it's just split across two systems - No clear boundary exists today; creating one is harder than finishing the migration

Option 3: Revert to Runtime-Only (REJECTED)

Why this is wrong: - Directly contradicts Elmar's design intent - Loses orchestrator intelligence that has been built and documented (AGENTS.md, delegation control plane, multi-model review) - Re-introduces the exact problem Elmar wanted to solve: workers talking directly to Elmar, no single coordination layer - The Control Room model is working; reverting would be a step backward


Concrete Migration Plan

Phase 1: Immediate (Today — Zero Downtime)

  1. Confirm Larry path migration
  2. Verify Larry's current workflow: does he still get tickets via mc_pickup.py dispatch, or has he already moved to Control Room handoff?
  3. If Larry still uses runtime: create a runbook for Luci to spawn Larry workers via SSH/tmux directly (this already exists in AGENTS.md: "SSH into Larry's host from inside your dev-loop session")

  4. Disable residual runtime triggers

  5. Ensure ticket-pickup.md and needs-input-pickup.md task files remain disabled (already done per pulse JSON)
  6. Set a config flag MC_CONTROL_ROOM_MODE=true in MC env to prevent accidental re-enablement

  7. Clean up stale runtime_sessions

  8. Archive or truncate old runtime_sessions rows older than 30 days (keep completed for analytics, but move to archive table)
  9. This is 178K rows — significant DB bloat

Phase 2: Short-Term (This Week)

  1. Retire mc_pickup.py
  2. Move mc_pickup.py to _deprecated/ (don't delete yet — keep for 30 days)
  3. Update any systemd services or scripts that reference it
  4. The file is 312KB and 7,971 lines — removing it significantly reduces codebase complexity

  5. Unify agent watcher model

  6. Luci: d237c9eb2a7c (Control Room watcher, no_agent script) — keep as-is
  7. Iris: 8 cron jobs, mix of no_agent scripts and LLM-driven — evaluate if any should move to event-driven
  8. Lucienne: No active cron watcher found — confirm if she needs one or if she operates on-demand
  9. Standardize on: no_agent script for polling/health checks, LLM invocation only when actionable work is found

  10. Solve the noise/cost issue

  11. The current Luci watcher (d237c9eb2a7c) is a no_agent script ("no_agent": true) — this is already the right approach
  12. The disabled mc-board-shepherd-5min was the expensive full-LLM-every-5-min job — correctly disabled
  13. Keep the no_agent script for polling; only invoke LLM when tickets need action
  14. Iris jobs: most are already event-driven or daily (not 5-min polling). The twice-daily noisy-email sweep and morning/evening digest are appropriate frequency.

Phase 3: Medium-Term (Next 2 Weeks)

  1. Build webhook/notification layer
  2. Add PostgreSQL/SQLite NOTIFY or lightweight webhook on ticket insert/update
  3. Replace cron polling with event-driven triggers where possible
  4. This addresses the core cost concern: no more 5-min polling burns
  5. Fallback: keep no_agent cron at reduced frequency (e.g., every 15m) for resilience

  6. Document the unified model

  7. Update AGENTS.md and runbooks to reflect: all work routes through orchestrator, no direct runtime sessions
  8. Document worker spawn patterns: tmux, SSH, Codex CLI, Claude Code, subagents

Phase 4: Cleanup (Next 30 Days)

  1. Delete deprecated code
  2. Remove _deprecated/mc_pickup.py after 30-day quarantine
  3. Drop runtime_sessions table or archive to cold storage
  4. Remove tmux session management code from MC if no longer needed

  5. Validate cost reduction

    • Measure token burn before/after: old model was 3 agents × 288 LLM runs/day = ~864 full LLM invocations/day
    • New model: no_agent scripts do 864 cheap polls, LLM invoked only on actionable events (estimate: 10-50/day)
    • Target: 90%+ reduction in token burn for polling

Handling Existing Runtime Sessions and mc_pickup.py

Artifact Current State Action
mc_pickup.py 7,971 lines, not dispatching Luci tickets Move to _deprecated/, retire in 30 days
runtime_sessions table 178K rows, 0 active Archive old rows, drop or keep for history
tmux mc-root 2 windows Inspect contents, kill if idle, document if active
Old ticket-pickup task files Disabled (per pulse JSON) Keep disabled, add config guard
mc-board-shepherd-5min cron Disabled since 2026-05-29 Delete after migration confirmed stable

Larry's coding tasks: Larry already works via SSH from Luci's dev-loop session (per AGENTS.md). The Control Room model doesn't change this — it just means Luci explicitly spawns the session and tracks it via MC comments, rather than mc_pickup.py auto-dispatching. This is actually more controlled, not less.


Risks and Mitigation

Risk Likelihood Impact Mitigation
Larry workflow disruption Medium High Verify Larry's current dispatch path before retiring mc_pickup.py; have Luci manually dispatch first few tickets
Ticket gets missed without auto-pickup Low High no_agent watcher still polls every 5 min; adds comment/status on finding actionable tickets
Iris jobs incompatible with Control Room Low Medium Iris already operates on different schedule/frequency; no change needed
Webhook layer fails, no polling fallback Low High Keep no_agent cron at 15m interval as safety net until webhooks proven
Historical runtime data loss Low Low Archive before dropping; 178K rows are mostly completed/failed sessions
Reversion pressure if issues arise Medium Medium Keep deprecated code for 30 days; document rollback procedure

Decision

Commit to Option 1: Full Control Room model.

The system is already mostly there. The migration is finishing work, not starting over. The cost savings, architectural clarity, and alignment with Elmar's intent make this the obvious choice.

Next action: Elmar to approve this recommendation, then execute Phase 1 immediately (zero downtime).