⌂ Home ☷ Board

Mission Control Worker Rearchitecture — tmux + Worktrees + xterm.js

Date: 2026-04-20 Author: Luci Audience: Elmar (decision), Lucienne (sync) Status: Plan — awaiting Elmar approval (A/B/C from Telegram) Trigger: Elmar question 2026-04-20T04:20Z — "does our MC chat need a rethink, is tmux better, how do we compare to vibe-kanban?"


TL;DR

Replace the current "spawn claude -p per message, bake message into prompt" pattern with long-lived claude --resume sessions running inside a tmux session per ticket, with a per-ticket git worktree, streamed to the browser via pipe-pane → tail -F → SSE → xterm.js. Keep the existing MC tables; add three columns and one sidecar log per ticket. ~3-4 days of focused work, shippable in phases.

This is the right shape for a single-operator system on one Linux box. We are deliberately not porting to vibe-kanban's Rust/portable-pty/WebSocket architecture — that's optimised for cross-platform distribution we don't need.


What we have today

(Full audit in conversation transcript 2026-04-20. Key facts:)

What vibe-kanban does (for context)

BloopAI/vibe-kanban — Rust + React, 25k stars. Each card = git worktree + portable-pty PTY + WebSocket + xterm.js + SQLite (Workspace/Session/ExecutionProcess tables). Cross-platform shipped product (Mac/Win/Linux, npx-installable).

We steal two ideas, we don't port the architecture:

  1. Worktree per ticketgit worktree add ../wt/ticket-N -b ticket-N. Parallel isolation, no merge conflicts between concurrent workers.
  2. Stream-json sidecarclaude --output-format stream-json written to a separate file for structured tool-call/cost/todo parsing, alongside the human-readable terminal pane.

Why tmux (not portable-pty)

What we need tmux gives us portable-pty gives us
Survives MC restart ✅ Free ❌ Need supervisor
Survives SSH disconnect ✅ Free ❌ Need daemon
tmux attach -t ticket-42 from SSH at 2am Killer feature ❌ No equivalent
TUI/permission-prompt rendering ✅ Real PTY ✅ Real PTY
Ship cost from current code Low (shell-out) High (Python PTY library + IPC)
Cross-platform (Mac/Windows) ❌ Linux/macOS only
Structured event model ❌ Parse stream-json sidecar ❌ Same

We're on one Hetzner box. Cross-platform doesn't matter. SSH-attach debug at 2am does.


Architecture (target state)

┌─────────────────┐         ┌──────────────────────────────────┐
│  Browser        │         │  Hetzner (Luci)                  │
│  (xterm.js)     │◀── SSE ─│                                  │
│                 │         │  ┌───────────────────────────┐   │
│  Send box ──────┼── POST ─▶  │ Flask MC (port 3001)      │   │
└─────────────────┘         │  │  /api/ticket/N/input ─────┼──┐│
                            │  │  /api/ticket/N/stream ◀───┼─┐││
                            │  └───────────────────────────┘ │││
                            │                                │││
                            │  tmux server (system)          │││
                            │  ┌───────────────────────────┐ │││
                            │  │ session: ticket-42        │ │││
                            │  │  └─ claude --resume <uuid>│ │││
                            │  │      cwd: ../wt/ticket-42 │ │││
                            │  │      stdout ──────────────┼─┼┼┘
                            │  │  pipe-pane -o ▶ /var/log/ │ │└── tail -F
                            │  │     mc/ticket-42.log      │ └─ stream-json
                            │  │  --output-format          │    sidecar:
                            │  │     stream-json ▶ sidecar │    /var/log/mc/
                            │  └───────────────────────────┘    ticket-42.json
                            │                                  │
                            │  git worktrees:                  │
                            │  ../wt/ticket-42 (branch ticket-42) │
                            │  ../wt/ticket-43 (branch ticket-43) │
                            └──────────────────────────────────┘

Lifecycle of a ticket

  1. Create ticket N → MC creates git worktree add ../wt/ticket-N -b ticket-N, creates tmux new-session -d -s ticket-N -c ../wt/ticket-N 'claude --resume <uuid> --output-format stream-json | tee /var/log/mc/ticket-N.json', runs tmux pipe-pane -t ticket-N -o 'cat >> /var/log/mc/ticket-N.log'.
  2. User opens ticket page → SSE endpoint tail -F /var/log/mc/ticket-N.log, replays last 10 KB, then streams new bytes. xterm.js renders.
  3. User types message + Enter → POST to /api/ticket/N/input → MC runs tmux send-keys -t ticket-N -l "<message>"; tmux send-keys -t ticket-N Enter. Claude (already running) sees stdin, processes, streams response back into the same pane → file → SSE → browser.
  4. MC restarts → tmux session keeps running, claude keeps running, log file keeps growing. New MC reconnects on next browse.
  5. Server reboots → systemd restarts tmux-mc.service (a tmux user-server unit) which restores empty tmux state; MC's startup task replays per-ticket tmux new-session for any ticket with status=in_progress (worker resumes from --resume <uuid>).
  6. Ticket done → MC runs tmux kill-session -t ticket-N, archives sidecar logs, optionally git worktree remove.

Structured events (sidecar)

The tmux pane is for humans. The stream-json sidecar (/var/log/mc/ticket-N.json) is for MC. A small parser tails the JSON file and writes: - tool_use events → existing ticket_events table (we already have this from MC-747) - result events → mark turn complete, broadcast turn_end SSE - cost_usd deltas → ticket cost counter

This keeps the structured-event model we already built; we're just changing how the events arrive (file tail vs subprocess pipe).


Data model changes

Add to tickets table: - tmux_session TEXT — tmux session name (usually ticket-{id}, NULL if not yet spawned) - worktree_path TEXT — absolute path to git worktree, NULL for tickets without code work - pane_log_path TEXT — absolute path to pipe-pane log file

Keep existing session_id (claude UUID), worker_pid (now points to claude process inside tmux), worker_started, status.

Drop after migration: - Nothing yet. Keep queued_messages for now — it becomes the buffer for messages typed while a previous turn is mid-stream (we send them when claude returns to prompt, not as fresh subprocesses).


Implementation phases

Phase 1 — Spike (1 day)

Goal: Prove the loop end-to-end on one ticket, no UI changes.

Phase 2 — SSE + xterm.js for one ticket (1 day)

Goal: Browser sees live tmux pane.

Phase 3 — Worktree-per-ticket (0.5 day)

Goal: Code-work tickets get isolated branches.

Phase 4 — Reconnect-on-boot (0.5 day)

Goal: Server reboots don't orphan tickets.

Phase 5 — Migrate existing dispatchers (1 day)

Goal: mc_interactive.py and mc_pickup.py route through mc_tmux instead of spawning per-message subprocesses.

Phase 6 — Polish (0.5 day)

Total: ~4-5 days of focused work, shippable phase-by-phase.


Risks and mitigations

Risk Mitigation
send-keys race on rapid bursts Use -l literal flag, send body and Enter as separate commands, no shell escaping
ANSI escape parsing in browser xterm.js handles all escape codes natively, no custom parser needed
tail -F SSE leaks file handles on disconnect Close file in SSE generator's finally block; cap idle SSE clients at 5 min like current WebSocket
Log files grow forever Logrotate /var/log/mc/*.log daily, keep 7 days. Sidecar JSON archived to ~/workspace/reports/sessions/ on ticket close
tmux server crash kills all sessions Run tmux as systemd user-service with Restart=always. Workers re-spawn on boot from Phase 4.
Two pollers on same Telegram bot (409 Conflict) Unrelated; the existing --settings settings-worker.json gate still applies — tmux doesn't change this
Worktree disk usage Cap at N=10 active worktrees; older ones archived. Monitor in heartbeat.
Migration breaks existing in-flight tickets Phase 5 ships behind a per-ticket flag (use_tmux=true); roll forward gradually, leave old path as fallback for 1 week

What we're explicitly not doing


Files that will change


Sources


Decision needed from Elmar

A) Approve full plan — I create MC tickets MC-NNN through MC-NNN+5 (one per phase) and start Phase 1 tonight B) Approve Phase 1 only as a POC — ship the spike, you review, then commit to the rest C) Adjust scope (e.g. drop worktrees, drop xterm.js for plain <pre>) — tell me what to cut D) Park — current MC is good enough