Context: Elmar asked whether Hermes Luci can review the existing Mission Control operator/pickup setup from outside MC before adding another scheduled task. This review was performed from the Hermes Telegram session, not by the MC ticket worker. No files, DB rows, tasks, or services were modified during the review, except a clarifying comment was added to the already-cancelled MC-4140 stating that no duplicate scheduled gatekeeper should be built.
The control plane is alive, but not yet trustworthy enough to answer “is it working?” with a clean yes.
Working evidence exists: scheduler, dashboard, ccgram, ticket-pickup, needs-input-pickup, luci-operator, orchestrator inbox drain, and runtime/ticket claims are all running and have recent successful runs.
The main problem is that several green paths are only green mechanically. They can skip, churn, race, or mark queue items processed before a durable ticket decision happens. This creates the exact uncertainty Elmar noticed.
luci-dashboard.service and ccgram.service are active.127.0.0.1:3001 and 100.118.207.3:3001.luci-operator is enabled and runs every 30 minutes via tasks/luci-operator.md:cd /home/lucienne/workspace/mission-control && python3 luci_operator.py --allow-dev-loopclaude_opus_1m_highluci-operator: completed at 21:30, duration ~49sticket-pickup: completed at 21:42, duration ~0.33sneeds-input-pickup: completed at 21:42, duration ~0.32sorchestrator-board-sweep: completed at 21:20triage-untriaged: completed every minute with clean no-op outputsneeds-input-pickup is not actually needs-input-onlytasks/needs-input-pickup.md says it is a fast pickup for needs_input tickets.
But the code path calls normal dispatch() and dispatch_larry(), which means it also scans normal todo tickets. So every minute we effectively run two pickup dispatchers:
ticket-pickupneeds-input-pickupCAS prevents both from claiming the same ticket, but it does not eliminate noisy races or slot-accounting risk.
Fix direction: make --needs-input-only truly only process requeued needs-input work, or remove/rename it and use one dispatcher.
MC-4122 is a live green-but-noisy no-op loopCurrent symptom:
Found 1 MC worker todo ticket(s) then Skipping MC-4122 — already claimed by another dispatcherticket-pickup and needs-input-pickupActual root cause appears to be policy mismatch, not “already claimed”:
MC-4122 is todo, assigned to lucienne, with campaign_owner=luciennemc_pickup.py includes lucienne in runnable worker assignees and maps it to lucimodels.claim_ticket() rejects the claim due to campaign-owner/assignee guardFix direction: exclude controller-parked / campaign_owner=lucienne tickets from pickup, or make the rejection reason explicit and non-noisy.
mc_pickup.py runnable assignees include:
luci, lucienne, tessa, scott, atlas, councilluci_operator.py runnable assignees include:
luci, larry, tessaThis difference is one reason MC-4122 looks runnable to one path and parked/non-runnable to another.
Fix direction: define one canonical runnable-owner policy used by operator, pickup, triage, and UI.
ticket-pickup and needs-input-pickup run every minute. luci_operator.py can also run pickup directly via _run_pickup_once().
The operator-run pickup command does not set MC_POOL_ENABLED=1, unlike the scheduled pickup task. That creates different runtime behavior depending on who triggered pickup.
Fix direction:
active_workers_by_db() appears fail-openThe code review found that if worker counting fails badly, the fallback returns 0 active workers. That means “full capacity available,” not “stop dispatch.”
The comment reportedly says “return 0 (don’t dispatch),” but operationally 0 means dispatch more.
Fix direction: on worker-count failure, return MAX_WORKERS or block dispatch with a loud error.
Current design risk:
REVIEW: or QUESTION:This directly explains why “the task says it ran” does not prove that MC actually made the right gate decision.
Fix direction: split inbox state into pending -> delivered -> acted/failed, and mark complete only after a structured per-ticket action is written.
luci_operator.py audits done tickets using broad prose signals such as “verified,” “fixed,” “resolved,” “implemented,” “updated,” “created,” and similar words.
That can miss weak closures if the right words appear, and can create false confidence without verifying commits, tests, live routes, screenshots, or artifacts.
Fix direction: require structured completion evidence for gates:
mc-auto-review can complete while skipping reviewA recent mc-auto-review run reportedly completed cleanly while skipping because the diff was ~41,969 lines, above its 5,000-line cap.
That is a green-but-no-review outcome.
Fix direction: make skip due to oversized diff a warning/incident, not a silent successful review.
luci_operator.py suppresses a new operator ticket if an existing matching ticket was closed recently. That can leave persistent dirty-repo or infra conditions acknowledged as “ticket exists” even when the ticket is already done.
Fix direction: if the condition persists after closure, reopen/comment/update the same incident or create a recurrence, rather than suppressing silently.
mc_pickup.log appears to receive both direct JSONL writes and diag logger writes for the same events.
Fix direction: use one primary event log or ensure downstream analysis dedupes correctly.
needs-input-only behavior.lucienne / campaign-owner / MC-4122 policy mismatch and misleading “already claimed” logging.MC-4140 should remain closed as a duplicate. The existing luci-operator already is the 30-minute Opus scheduled task.
The next ticket, if created, should not be “build a recurring gatekeeper.” It should be: “Fix current MC operator/pickup reliability based on outside review,” with the above findings as acceptance criteria.