Harden ticket runtime harvest timeout and long-prompt recovery
Investigate and harden the runtime failure class exposed by MC-3919. Observed MC-3919 failure evidence: - Elmar added a follow-up comment requesting discrepancy review plus sou...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge18d ago
Ticket is done; runtime is closed.·profile codex · cwd /home/lucienne/workspace/mission-control · uptime 17d 23h · last activity 17d 20h ago
Description
MC-4236
Investigate and harden the runtime failure class exposed by MC-3919.
Observed MC-3919 failure evidence:
- Elmar added a follow-up comment requesting discrepancy review plus source snippets/screenshots.
- Claude runtime was not blocked by the task; it was stuck in TUI/stop-hook/harvest chrome: pane log showed repeated “Gitifying…” / “running stop hooks… 0/5 · 3m+” at ~100k tokens and no DONE/REVIEW/QUESTION or mc-coord signal.
- Background harvest timed out and app.py _start_ticket_runtime_harvest killed the runtime, then updated the ticket to status=blocked, pending_state=crashed, failure_reason=ticket_runtime_harvest_timeout.
- final_harvest_sweep preserved mostly terminal chrome, so MC could not infer a verdict.
- Controller reopened it, but pickup then hit a second failure: runtime/send returned “command too long”; the dispatcher tried to revert status but transition gate rejected the revert, leaving an idle/in_progress runtime until controller manually injected a shorter prompt.
Root-cause hypothesis to verify before fixing:
1. Harvest timeout path treats a recoverable no-verdict/stuck TUI condition as a blocked ticket instead of a self-healing runtime retry/requeue.
2. The send path still has at least one long-prompt/argv limit or payload-size path that can return “command too long” despite mc_tmux.send_input having load-buffer fallback; dispatch rollback is not force-safe after claim transitions.
3. Initial prompts for document-review tickets include enough description/comment/transcript text to exceed safe interactive injection limits and should be compacted/staged.
Acceptance criteria:
- Add focused regression tests around ticket_runtime/app harvest timeout behavior and long-prompt dispatch failure behavior.
- For timeout with no harvestable verdict and no provider/auth failure, do not leave ticket blocked on Elmar; park it in a Luci-owned recoverable state (todo or explicit runtime_recovery) with safe retry metadata, unless repeated attempts hit a configured threshold.
- Preserve useful transcript evidence but filter terminal-only chrome better; if sweep is only spinner/stop-hook chrome, record that clearly.
- Ensure runtime/send/pickup can recover from large initial prompts: use file/buffer staging or compact prompt fallback; never leave ticket in_progress with failed_to_inject command-too-long and no running turn.
- Make status rollback force-safe for dispatcher-owned technical failures so illegal_transition 422 does not strand the ticket.
- Add an operator comment pattern that names root cause, safe retry, and whether human input is needed.
- Validate against MC-3919-shaped fixture/log and a synthetic >32KB/large-comment prompt.
- Update docs/runtime-architecture-refresh.md if runtime lifecycle/status contract changes.
Do not disrupt the currently running MC-3919 content work; this is a separate control-plane hardening ticket.
Expected check-in: 2026-05-26T12:23+02:00 — root-cause findings or first fix status.
Additional acceptance from MC-4228 UI bug: browser ticket chat must not show a false “Send not confirmed; retry” after durable persistence/injection; initial send must auto-start/attach runtime or clearly queue, and duplicate resends must be prevented.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· High
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
telegram (human)
Mission Control
18d ago
16d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.terminal_state_reapprofile codex · cwd /home/lucienne/workspace/mission-control · uptime 17d 23h · last activity 17d 20h agoMC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.Raw console: luci · codex/openai/gpt-5.5
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
[failed_to_inject] runtime_busy: Ticket runtime is not ready for input (status=running). Wait for the current turn to finish, queue the message, or explicitly interrupt/restart the runtime before sending more terminal i
Ticket picked up by Luci via MC dispatcher.
MC-4236: Harden ticket runtime harvest timeout and long-prompt recovery
Work this ticket in the live tmux runtime. Use DONE:, REVIEW:, or QUESTION: when you need MC to reflect the next state.
Luci18d ago
MC-4236 shipped: harvest timeouts now recover/requeue with structured comments until retry limit, chrome-only sweeps are classified, large prompts stage safely and tmux send falls back to paste-buffer, dispatcher send/spawn failures force-safe requeue without stranding tickets. Commits pushed: mission-control 87d76d9, luci-workspace e7e97f50.
luci18d ago
Workflow action `review_ready` applied. Work ready for review.
api18d ago
Closed (orchestrator). Both commits on master + tests green:
- mission-control 87d76d9 (+764/-150): app.py harvest classification, ticket_runtime.py recovery path, mc_tmux.py paste-buffer fallback for large prompts, runtime-architecture-refresh.md updated. 42/42 tests pass (test_mc4236_runtime_recovery + test_mc3482_recovery_contract + test_mc_tmux).
- workspace e7e97f50 (+303/-12): mc_pickup.py dispatcher force-safe requeue, +130 lines test_mc4236_pickup_recovery, 3/3 pass.
ACs covered: harvest timeout recoverable-state requeue ✓, terminal-chrome classification ✓, paste-buffer fallback for command-too-long ✓, dispatcher force-safe requeue (no illegal-transition 422 stranding) ✓, root-cause/safe-retry operator comment pattern ✓, MC-3919-shaped fixture in tests ✓, runtime-architecture-refresh.md updated ✓.
Addendum AC (MC-4228 UI bug — false 'Send not confirmed' after durable injection): separate browser/chat UI fix; not in this commit set. Recommend follow-up ticket if Elmar wants the UI ack/state hardened.
UA sign-off (runbooks/ua-signoff-gate.md): n/a — backend hardening, no user-facing surface change.
Verdict: PASS
luci-controller18d ago
[created · 2026-05-26T11:37:57+02:00] Created from Elmar request after MC-3919 runtime failure. Investigate first; fix only after root cause is proven. Keep MC-3919 running separately.
luci-controller18d ago
[controller ledger · 2026-05-26T11:38:59+02:00] Expected check-in set to 2026-05-26T12:23+02:00 for root-cause findings / first fix status.
luci-controller18d ago
[controller addendum · 2026-05-26T12:14:43+02:00] MC ticket UI bug confirmed from MC-4228 screenshot/ledger: browser chat showed 'Send not confirmed; please retry' even though the first message was durably persisted and injected (ticket_messages 36004 at 10:08:48). User manually started session and resent, producing duplicate injected message 36005 at 10:09:32. Harden acceptance: runtime send must either ack/queue immediately after durable persistence with clear 'starting runtime/sending' state, or disable duplicate retry; first send should auto-start/attach runtime without requiring manual Start session, and false negative ack timeout must not ask the user to resend an already-injected message.