MC operator / pickup outside review

Context: Elmar asked whether Hermes Luci can review the existing Mission Control operator/pickup setup from outside MC before adding another scheduled task. This review was performed from the Hermes Telegram session, not by the MC ticket worker. No files, DB rows, tasks, or services were modified during the review, except a clarifying comment was added to the already-cancelled MC-4140 stating that no duplicate scheduled gatekeeper should be built.

Bottom line

The control plane is alive, but not yet trustworthy enough to answer “is it working?” with a clean yes.

Working evidence exists: scheduler, dashboard, ccgram, ticket-pickup, needs-input-pickup, luci-operator, orchestrator inbox drain, and runtime/ticket claims are all running and have recent successful runs.

The main problem is that several green paths are only green mechanically. They can skip, churn, race, or mark queue items processed before a durable ticket decision happens. This creates the exact uncertainty Elmar noticed.

Evidence that it is working

luci-dashboard.service and ccgram.service are active.
Mission Control is reachable on both 127.0.0.1:3001 and 100.118.207.3:3001.
luci-operator is enabled and runs every 30 minutes via tasks/luci-operator.md:
command: cd /home/lucienne/workspace/mission-control && python3 luci_operator.py --allow-dev-loop
runtime profile: claude_opus_1m_high
Recent task-runs show successful execution:
luci-operator: completed at 21:30, duration ~49s
ticket-pickup: completed at 21:42, duration ~0.33s
needs-input-pickup: completed at 21:42, duration ~0.32s
orchestrator-board-sweep: completed at 21:20
triage-untriaged: completed every minute with clean no-op outputs
The operator log records actual checks:
disk/memory snapshots
active lane snapshots
blocked lane classification
done audit summary
recent task failure detection
dirty repo detection
Pickup does successfully claim real tickets when allowed, and claim CAS prevents double-claiming the same ticket.

Main issues found

1. `needs-input-pickup` is not actually needs-input-only

tasks/needs-input-pickup.md says it is a fast pickup for needs_input tickets.

But the code path calls normal dispatch() and dispatch_larry(), which means it also scans normal todo tickets. So every minute we effectively run two pickup dispatchers:

ticket-pickup
needs-input-pickup

CAS prevents both from claiming the same ticket, but it does not eliminate noisy races or slot-accounting risk.

Fix direction: make --needs-input-only truly only process requeued needs-input work, or remove/rename it and use one dispatcher.

2. `MC-4122` is a live green-but-noisy no-op loop

Current symptom:

repeated pickup output: Found 1 MC worker todo ticket(s) then Skipping MC-4122 — already claimed by another dispatcher
this happened hundreds of times across ticket-pickup and needs-input-pickup

Actual root cause appears to be policy mismatch, not “already claimed”:

MC-4122 is todo, assigned to lucienne, with campaign_owner=lucienne
mc_pickup.py includes lucienne in runnable worker assignees and maps it to luci
models.claim_ticket() rejects the claim due to campaign-owner/assignee guard
the dispatcher reports that rejection as “already claimed by another dispatcher”

Fix direction: exclude controller-parked / campaign_owner=lucienne tickets from pickup, or make the rejection reason explicit and non-noisy.

3. Operator and pickup disagree on runnable owners

mc_pickup.py runnable assignees include:

luci, lucienne, tessa, scott, atlas, council

luci_operator.py runnable assignees include:

luci, larry, tessa

This difference is one reason MC-4122 looks runnable to one path and parked/non-runnable to another.

Fix direction: define one canonical runnable-owner policy used by operator, pickup, triage, and UI.

4. Pickup concurrency is duplicated

ticket-pickup and needs-input-pickup run every minute. luci_operator.py can also run pickup directly via _run_pickup_once().

The operator-run pickup command does not set MC_POOL_ENABLED=1, unlike the scheduled pickup task. That creates different runtime behavior depending on who triggered pickup.

Fix direction:

add a global single-flight pickup lock across all paths, and/or
move slot reservation into a DB/API transaction, and
stop the operator from spawning a separate pickup loop unless it uses the exact same locked path/environment.

5. `active_workers_by_db()` appears fail-open

The code review found that if worker counting fails badly, the fallback returns 0 active workers. That means “full capacity available,” not “stop dispatch.”

The comment reportedly says “return 0 (don’t dispatch),” but operationally 0 means dispatch more.

Fix direction: on worker-count failure, return MAX_WORKERS or block dispatch with a loud error.

6. Orchestrator inbox is processed on delivery, not on durable decision

Current design risk:

worker emits REVIEW: or QUESTION:
ticket/inbox item is created
pickup drains digest to persistent Luci
inbox item is marked processed after delivery/harvest attempt
if Luci ignores/misfires/fails to act, the queue item is no longer pending

This directly explains why “the task says it ran” does not prove that MC actually made the right gate decision.

Fix direction: split inbox state into pending -> delivered -> acted/failed, and mark complete only after a structured per-ticket action is written.

7. Completion proof is too keyword/prose driven

luci_operator.py audits done tickets using broad prose signals such as “verified,” “fixed,” “resolved,” “implemented,” “updated,” “created,” and similar words.

That can miss weak closures if the right words appear, and can create false confidence without verifying commits, tests, live routes, screenshots, or artifacts.

Fix direction: require structured completion evidence for gates:

commit hash / branch / pushed verification
test command and result
live API/route check
artifact path / screenshot for UI
mc-coord DONE/REVIEW signal with nonce / attempt id

8. `mc-auto-review` can complete while skipping review

A recent mc-auto-review run reportedly completed cleanly while skipping because the diff was ~41,969 lines, above its 5,000-line cap.

That is a green-but-no-review outcome.

Fix direction: make skip due to oversized diff a warning/incident, not a silent successful review.

9. Operator tickets can be suppressed while the condition persists

luci_operator.py suppresses a new operator ticket if an existing matching ticket was closed recently. That can leave persistent dirty-repo or infra conditions acknowledged as “ticket exists” even when the ticket is already done.

Fix direction: if the condition persists after closure, reopen/comment/update the same incident or create a recurrence, rather than suppressing silently.

10. Logs are noisy/duplicated

mc_pickup.log appears to receive both direct JSONL writes and diag logger writes for the same events.

Fix direction: use one primary event log or ensure downstream analysis dedupes correctly.

Recommended implementation sequence

Fix pickup single-flight and needs-input-only behavior.
Fix the lucienne / campaign-owner / MC-4122 policy mismatch and misleading “already claimed” logging.
Make worker-count failure dispatch-blocking, not fail-open.
Stop operator-triggered pickup from using a different environment/path.
Change orchestrator inbox processed semantics to require durable per-ticket action.
Move semantic gate/close decisions out of the broad operator path and into one orchestrator-owned structured action path.
Replace keyword/prose done-audit proof with structured evidence requirements.
Turn green-but-skipped review outcomes into explicit warning/incident states.
Add durable idempotency keys for operator/watchdog incidents.

Recommendation about MC-4140

MC-4140 should remain closed as a duplicate. The existing luci-operator already is the 30-minute Opus scheduled task.

The next ticket, if created, should not be “build a recurring gatekeeper.” It should be: “Fix current MC operator/pickup reliability based on outside review,” with the above findings as acceptance criteria.

MC operator / pickup outside review — 2026-05-24