PRD: Mission Control Multi-Agent Orchestration Architecture

Owner: Elmar / Luci Date: 2026-06-01 Status: Draft for independent council planning

1. Problem

We need a reliable architecture for orchestrating a multi-agent software/project workflow using Mission Control (MC), Hermes/Luci, and multiple CLI/LLM runtimes such as Claude Code, Codex CLI, Gemini CLI, Kimi/GLM, and Hermes profiles.

The current confusion is around where the “controller” lives, what should be persistent, what should be scripted, and how work automatically advances through planning, implementation, review, QA, testing, and validation without Elmar having to manually ask for follow-up.

2. Background

Elmar previously used Lucienne’s older worker orchestration model with multiple agent profiles. That system worked well because when an agent finished a stage, Lucienne/orchestration knew immediately and handed the work to the next stage. Elmar did not need to ask “what is going on?” or “how far are you?”

Mission Control currently has tickets, comments, runtime sessions, tmux-backed interactive runtimes, scheduler/watchdog scripts, and multiple possible runtimes. However, recent Control Room discussions exposed uncertainty about the right architecture:

Should there be one global persistent controller?
Should each ticket have its own persistent controller?
Should “Planner / Builder / Reviewer / Tester / Validator” be actual persistent agents, ephemeral agents, or just workflow gates?
Should Hermes be the controller layer, a worker runtime, or both?
Can Claude Code / Codex / Gemini act as controllers, or should they stay workers?
How should automatic next-stage handoff work?
What parts should be deterministic scripts vs LLM judgment?

3. Current Assets / Constraints

Existing assets

Mission Control web app and SQLite DB.
Tickets, comments, statuses, runtime session ledger, scheduler/task run history.
tmux-backed runtime sessions.
CLI runtimes available or planned: Claude Code, Codex CLI, Gemini CLI, Kimi/GLM wrappers, Hermes profiles.
Telegram/MC UI as human control surfaces.
Watchdog/scheduler scripts for deterministic checks.
Existing requirement that workers should not talk directly to Elmar; the orchestrator should speak in one voice.

Constraints

Must preserve Claude Code, Codex, Gemini, Kimi/GLM, Hermes as selectable runtimes where appropriate.
Must avoid one Telegram bot token being polled by multiple workers.
Must avoid uncontrolled auto-dispatch from arbitrary ticket comments.
Must support interactive runtime sessions where follow-up can be sent back to the same worker.
Must not require 7 always-running agents per ticket if that is resource-prohibitive.
Must let Elmar comment on a ticket and have that comment reach the right controller/owner.
Must make stuck tickets visible and recoverable automatically.
Must support small tasks without excessive ceremony.
Must support serious coding tasks with planning, implementation, independent review, testing, validation, evidence, and final gating.

4. Goals

Define a clear mental model for:
Mission Control
Luci / Hermes
per-ticket controllers, if any
global supervisor, if any
runtimes/workers
workflow gates
scripts/watchdogs
Define the lifecycle for a ticket from creation to Done.
Define how a completed stage automatically triggers the next stage.
Define when to use:
direct controller work
short-lived subagents
persistent builder sessions
ephemeral reviewers/testers
deterministic scripts
Define how Hermes profiles and CLI runtimes should be used.
Define how to preserve interactive continuity with Claude/Codex/Gemini/etc. without overusing resources.
Define a safe incremental implementation plan.

5. Non-Goals

Do not design a system that requires every role to be a permanent always-on process.
Do not replace Claude Code/Codex/Gemini with Hermes-only workers.
Do not rely purely on LLM memory for workflow state.
Do not create a design where workers self-approve final Done.
Do not require Elmar to manually babysit ordinary stage handoffs.

6. Required User Experience

Ticket interaction

Elmar should be able to open/comment on a ticket and expect:

The comment reaches the correct controller/owner for that ticket.
If the ticket is active, the system knows the current phase and active worker.
If the ticket is stuck, the system knows why or can escalate.
Elmar can ask “what is going on?” and get a coherent answer from the ticket’s current controller/owner.

Stage handoff

When a worker/stage finishes:

The system records structured evidence.
The system knows the next required gate.
The next stage starts automatically if it is a technical workflow step.
Elmar is only asked for genuine human/product/priority/taste/spend decisions.

Resource usage

The system should avoid unnecessary live sessions. It should distinguish durable state from live processes.

7. Questions for Independent Plan Authors

Please propose a fresh architecture from scratch. Do not assume a specific answer.

Your plan should answer:

What is the controller?
Is there one controller globally, one per ticket, or both?
Which parts are scripts/state machines vs LLM agents?
Where does Hermes fit?
Where do Claude Code, Codex CLI, Gemini CLI, Kimi/GLM fit?
What should be persistent vs ephemeral?
How does a ticket advance from plan → build → review → test → validate → done?
How does Elmar interact with the system?
How are stuck tickets detected and recovered?
How would you implement this incrementally with least risk?

8. Expected Output From Council Member

Return a plan with:

Executive recommendation.
Architecture diagram in text form.
Component responsibilities.
Ticket lifecycle.
Runtime/session policy.
Deterministic script vs LLM decision boundary.
Implementation phases.
Risks/tradeoffs.
What you would not build.
Comparison to common multi-agent patterns if relevant.