You're offline — showing cached data

MC-4236

Harden ticket runtime harvest timeout and long-prompt recovery
2026-06-13 08:48:32 SAST
Home Board MC-4236

Harden ticket runtime harvest timeout and long-prompt recovery

Investigate and harden the runtime failure class exposed by MC-3919. Observed MC-3919 failure evidence: - Elmar added a follow-up comment requesting discrepancy review plus sou...
State Done Next Action Closed Owner Luci Runtime Closed Age 18d ago
MC-4236
Ticket is done; runtime is closed. · profile codex · cwd /home/lucienne/workspace/mission-control · uptime 17d 23h · last activity 17d 20h ago

Description

MC-4236
Investigate and harden the runtime failure class exposed by MC-3919. Observed MC-3919 failure evidence: - Elmar added a follow-up comment requesting discrepancy review plus source snippets/screenshots. - Claude runtime was not blocked by the task; it was stuck in TUI/stop-hook/harvest chrome: pane log showed repeated “Gitifying…” / “running stop hooks… 0/5 · 3m+” at ~100k tokens and no DONE/REVIEW/QUESTION or mc-coord signal. - Background harvest timed out and app.py _start_ticket_runtime_harvest killed the runtime, then updated the ticket to status=blocked, pending_state=crashed, failure_reason=ticket_runtime_harvest_timeout. - final_harvest_sweep preserved mostly terminal chrome, so MC could not infer a verdict. - Controller reopened it, but pickup then hit a second failure: runtime/send returned “command too long”; the dispatcher tried to revert status but transition gate rejected the revert, leaving an idle/in_progress runtime until controller manually injected a shorter prompt. Root-cause hypothesis to verify before fixing: 1. Harvest timeout path treats a recoverable no-verdict/stuck TUI condition as a blocked ticket instead of a self-healing runtime retry/requeue. 2. The send path still has at least one long-prompt/argv limit or payload-size path that can return “command too long” despite mc_tmux.send_input having load-buffer fallback; dispatch rollback is not force-safe after claim transitions. 3. Initial prompts for document-review tickets include enough description/comment/transcript text to exceed safe interactive injection limits and should be compacted/staged. Acceptance criteria: - Add focused regression tests around ticket_runtime/app harvest timeout behavior and long-prompt dispatch failure behavior. - For timeout with no harvestable verdict and no provider/auth failure, do not leave ticket blocked on Elmar; park it in a Luci-owned recoverable state (todo or explicit runtime_recovery) with safe retry metadata, unless repeated attempts hit a configured threshold. - Preserve useful transcript evidence but filter terminal-only chrome better; if sweep is only spinner/stop-hook chrome, record that clearly. - Ensure runtime/send/pickup can recover from large initial prompts: use file/buffer staging or compact prompt fallback; never leave ticket in_progress with failed_to_inject command-too-long and no running turn. - Make status rollback force-safe for dispatcher-owned technical failures so illegal_transition 422 does not strand the ticket. - Add an operator comment pattern that names root cause, safe retry, and whether human input is needed. - Validate against MC-3919-shaped fixture/log and a synthetic >32KB/large-comment prompt. - Update docs/runtime-architecture-refresh.md if runtime lifecycle/status contract changes. Do not disrupt the currently running MC-3919 content work; this is a separate control-plane hardening ticket. Expected check-in: 2026-05-26T12:23+02:00 — root-cause findings or first fix status. Additional acceptance from MC-4228 UI bug: browser ticket chat must not show a false “Send not confirmed; retry” after durable persistence/injection; initial send must auto-start/attach runtime or clearly queue, and duplicate resends must be prevented.

Activity

done
Luci is working...
Live
No activity yet
Help