Fix MC operator/pickup reliability — outside review findings
Based on external Hermes review (2026-05-24): /home/lucienne/workspace/reports/mc-operator-outside-review-2026-05-24.md Control plane is alive but not cleanly trustworthy. Prio...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge5d ago
Ticket is done; runtime is closed.·profile claude_opus_1m_medium
Description
MC-4831
Based on external Hermes review (2026-05-24): /home/lucienne/workspace/reports/mc-operator-outside-review-2026-05-24.md
Control plane is alive but not cleanly trustworthy. Priority fixes:
1. **Single-flight pickup lock** — `needs-input-pickup` + `ticket-pickup` both run every minute and compete. Add global lock or merge into one dispatcher.
2. **Make needs-input-pickup truly needs-input-only** — currently it calls `dispatch()` and handles all todo tickets, not just requeued needs_input. Either fix or remove.
3. **Fix MC-4122 policy mismatch** — pickup sees `lucienne`-assigned tickets as runnable, but claim policy rejects them. Logs misleadingly say "already claimed." Fix assignee/campaign-owner guard and logging.
4. **Fix worker-count fail-open** — `active_workers_by_db()` returns 0 on failure, meaning "no workers active" → dispatch proceeds with full slots. Should fail closed or return MAX_WORKERS.
5. **Orchestrator inbox needs durable action proof** — inbox items marked `processed` after delivery to luci-persistent, not after per-ticket decision/action. Split states: pending → delivered → acted. Require structured result before done.
6. **Move semantic gates out of operator** — operator directly changes ticket status to done/in_review based on keywords. Should create orchestrator inbox items instead, not directly close work tickets.
7. **Structured completion proof** — audit still uses broad keywords (verified/fixed/implemented). Require mc-coord signals, attempt id, commit hash, test/deploy evidence.
8. **Operator idempotency** — operator tickets can stay suppressed 4h after done while condition persists. Add recurrence tracking or reopen prior ticket.
9. **Stop operator from self-modifying tasks** — `luci_operator.py` re-enables `ticket-pickup` when backlog exists. Add explicit maintenance policy flag.
10. **Centralize ticket mutations** — operator + pickup do direct SQLite writes bypassing API lifecycle. Centralize status transitions through CAS-semantic API calls.
Full findings + evidence in report.
Acceptance:
- One pickup owner with lock
- needs-input-only handles only needs_input requeues
- Policy-skip logging distinct from claim conflict
- Worker counting fails closed
- Inbox items only marked processed after durable action
- Operator observes/alerts, orchestrator decides semantic gates
- Completion requires structured evidence
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· High
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
api (human)
Mission Control
5d ago
5d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
[visibility-only controller] Routed this Luci-owned technical control-plane ticket to internal Hermes Kanban card t_f55dfd05. Initial claudebuilder launch failed immediately on Anthropic extra-usage/quota, so I reclaimed/reassigned the same card. A codexbuilder retry launched from the wrong cwd and was safety-stopped before code work. I retargeted the same card to a pre-created isolated dir/worktree and dispatched fallback profile default; verified active worker PID 2158707 is running in `/home/lucienne/workspace/state/kanban-worktrees/mc-4831-operator-pickup-reliability`. MC remains visibility-only/waiting on `kanban_active:t_f55dfd05`; no MC runtime, pickup, send, claim, harvest, switch, or retry endpoint was used.
luci-board-manager5d ago
[visibility-only controller gate] I rejected the previous t_f55dfd05 completion as insufficient: it left dirty WIP in the isolated worktree and the controller narrow test failed (`python3.12 -m pytest tests/test_luci_operator_visibility_only.py tests/test_visibility_only_mode.py tests/test_task_definition_editor.py -q`: 1 failed, 28 passed). Routed a precise continuation through internal Kanban card t_49cbff3f (codexbuilder) in `/home/lucienne/workspace/state/kanban-worktrees/mc-4831-operator-pickup-reliability`; verified worker PID 2177597 cwd matches that isolated worktree. MC remains visibility-only/waiting on `kanban_active:t_49cbff3f`. No MC runtime/pickup endpoints were used.
luci-board-manager5d ago
[visibility-only controller gate] Independent review rejected internal Kanban card t_49cbff3f before landing: `api_update_task_definition()` used the regex/string `_parse_task_enabled_from_content(content)` for the visibility-only re-enable guard, so YAML typed truthy values such as `enabled: !!bool true` could still re-enable retired pickup tasks under YAML parsing. I commented the exact blocker on t_49cbff3f, unblocked it for codexbuilder continuation, and kept MC-4831 waiting on `kanban_active:t_49cbff3f`. Narrow tests before review were green (29/29), but this safety-boundary bypass blocks landing. No MC runtime/pickup endpoints were used.
luci-board-manager5d ago
[visibility-only controller gate] MC-4831 passed and is live. After rejecting the first handoff for the YAML-frontmatter truthy bypass, I verified the fix on `origin/kanban/mc-4831-operator-pickup-reliability` (`6c5169ee133cf102dd71a31f30feebdeac0cc375`), reran `PATH=/tmp/mc4831-py312/bin:$PATH python3.12 -m pytest tests/test_luci_operator_visibility_only.py tests/test_visibility_only_mode.py tests/test_task_definition_editor.py -q` (30 passed + 3 subtests), received independent review PASS, cherry-picked the branch onto current `origin/master` as `b43b952` + `3a45297070dbccef5d5b93ef31e8f11ffed10665`, pushed `origin/master`, fast-forwarded the live checkout, ran `graphify update .`, restarted `luci-dashboard.service`, and smoked live `http://127.0.0.1:3001/api/health`, `/api/v1/tickets?limit=1`, `/`, and `/board` successfully. No MC runtime/pickup/start/send/claim/harvest/switch/retry endpoints were used.