⌂ Home ☷ Board

MC operator / pickup outside review — 2026-05-24

Context: Elmar asked whether Hermes Luci can review the existing Mission Control operator/pickup setup from outside MC before adding another scheduled task. This review was performed from the Hermes Telegram session, not by the MC ticket worker. No files, DB rows, tasks, or services were modified during the review, except a clarifying comment was added to the already-cancelled MC-4140 stating that no duplicate scheduled gatekeeper should be built.

Bottom line

The control plane is alive, but not yet trustworthy enough to answer “is it working?” with a clean yes.

Working evidence exists: scheduler, dashboard, ccgram, ticket-pickup, needs-input-pickup, luci-operator, orchestrator inbox drain, and runtime/ticket claims are all running and have recent successful runs.

The main problem is that several green paths are only green mechanically. They can skip, churn, race, or mark queue items processed before a durable ticket decision happens. This creates the exact uncertainty Elmar noticed.

Evidence that it is working

Main issues found

1. needs-input-pickup is not actually needs-input-only

tasks/needs-input-pickup.md says it is a fast pickup for needs_input tickets.

But the code path calls normal dispatch() and dispatch_larry(), which means it also scans normal todo tickets. So every minute we effectively run two pickup dispatchers:

CAS prevents both from claiming the same ticket, but it does not eliminate noisy races or slot-accounting risk.

Fix direction: make --needs-input-only truly only process requeued needs-input work, or remove/rename it and use one dispatcher.

2. MC-4122 is a live green-but-noisy no-op loop

Current symptom:

Actual root cause appears to be policy mismatch, not “already claimed”:

Fix direction: exclude controller-parked / campaign_owner=lucienne tickets from pickup, or make the rejection reason explicit and non-noisy.

3. Operator and pickup disagree on runnable owners

mc_pickup.py runnable assignees include:

luci_operator.py runnable assignees include:

This difference is one reason MC-4122 looks runnable to one path and parked/non-runnable to another.

Fix direction: define one canonical runnable-owner policy used by operator, pickup, triage, and UI.

4. Pickup concurrency is duplicated

ticket-pickup and needs-input-pickup run every minute. luci_operator.py can also run pickup directly via _run_pickup_once().

The operator-run pickup command does not set MC_POOL_ENABLED=1, unlike the scheduled pickup task. That creates different runtime behavior depending on who triggered pickup.

Fix direction:

5. active_workers_by_db() appears fail-open

The code review found that if worker counting fails badly, the fallback returns 0 active workers. That means “full capacity available,” not “stop dispatch.”

The comment reportedly says “return 0 (don’t dispatch),” but operationally 0 means dispatch more.

Fix direction: on worker-count failure, return MAX_WORKERS or block dispatch with a loud error.

6. Orchestrator inbox is processed on delivery, not on durable decision

Current design risk:

This directly explains why “the task says it ran” does not prove that MC actually made the right gate decision.

Fix direction: split inbox state into pending -> delivered -> acted/failed, and mark complete only after a structured per-ticket action is written.

7. Completion proof is too keyword/prose driven

luci_operator.py audits done tickets using broad prose signals such as “verified,” “fixed,” “resolved,” “implemented,” “updated,” “created,” and similar words.

That can miss weak closures if the right words appear, and can create false confidence without verifying commits, tests, live routes, screenshots, or artifacts.

Fix direction: require structured completion evidence for gates:

8. mc-auto-review can complete while skipping review

A recent mc-auto-review run reportedly completed cleanly while skipping because the diff was ~41,969 lines, above its 5,000-line cap.

That is a green-but-no-review outcome.

Fix direction: make skip due to oversized diff a warning/incident, not a silent successful review.

9. Operator tickets can be suppressed while the condition persists

luci_operator.py suppresses a new operator ticket if an existing matching ticket was closed recently. That can leave persistent dirty-repo or infra conditions acknowledged as “ticket exists” even when the ticket is already done.

Fix direction: if the condition persists after closure, reopen/comment/update the same incident or create a recurrence, rather than suppressing silently.

10. Logs are noisy/duplicated

mc_pickup.log appears to receive both direct JSONL writes and diag logger writes for the same events.

Fix direction: use one primary event log or ensure downstream analysis dedupes correctly.

Recommended implementation sequence

  1. Fix pickup single-flight and needs-input-only behavior.
  2. Fix the lucienne / campaign-owner / MC-4122 policy mismatch and misleading “already claimed” logging.
  3. Make worker-count failure dispatch-blocking, not fail-open.
  4. Stop operator-triggered pickup from using a different environment/path.
  5. Change orchestrator inbox processed semantics to require durable per-ticket action.
  6. Move semantic gate/close decisions out of the broad operator path and into one orchestrator-owned structured action path.
  7. Replace keyword/prose done-audit proof with structured evidence requirements.
  8. Turn green-but-skipped review outcomes into explicit warning/incident states.
  9. Add durable idempotency keys for operator/watchdog incidents.

Recommendation about MC-4140

MC-4140 should remain closed as a duplicate. The existing luci-operator already is the 30-minute Opus scheduled task.

The next ticket, if created, should not be “build a recurring gatekeeper.” It should be: “Fix current MC operator/pickup reliability based on outside review,” with the above findings as acceptance criteria.