⌂ Home ☷ Board

MC orchestration scheduled-task simplification review — 2026-05-28

Verdict

Alive, but over-layered. The MC orchestration scheduled tasks are mostly firing and the core path is functioning, but the control plane has accumulated overlapping minute loops, watchdogs, cleanup jobs, and meta-tuning tasks. The simplification target should be: one minute-level intake/dispatch loop, one 15-minute operator loop, one external Hermes watchdog, and a small set of true cleanup/service monitors.

Evidence gathered read-only:

What each task does and simplification verdict

External Hermes cron

Operator / control-plane decision loop

Intake / dispatch

Reapers / cleanup

Persistent Luci

Watchdogs / sweeps

Main overlaps

  1. Three every-minute intake loops: ticket-pickup, needs-input-pickup, triage-untriaged.
  2. Multiple stale-review/needs-input triagers: ticket-pickup, orchestrator-board-sweep, and luci-operator.
  3. Duplicate tmux cleanup: ticket-pickup has reaping; reap-zombie-workers also reaps.
  4. Duplicate scheduler monitors: Hermes external watchdog, scheduler-watchdog, and cron-watchdog all inspect task freshness/failures.
  5. Persistent-session health split: persistent-luci-watchdog, persistent-luci-branch-guard, rotate-luci-session, plus systemd-watchdog all touch parts of persistent Luci health.
  6. Meta-operator tuning creates human-facing noise: nightly luci-operator-tuner filed an Elmar ticket.

Recommended simplified shape

Keep as core layers

  1. External sentinel
  2. Keep: Hermes MC control-plane watchdog (quiet, 15m).
  3. Purpose: outside-scheduler truth source; quiet on health; alert only genuine human/action issues.

  4. Single minute intake/dispatch

  5. Keep/expand: ticket-pickup.
  6. Fold into it or trigger from it:

    • needs-input-pickup
    • triage-untriaged
    • lightweight stale needs_input/in_review routing
    • lightweight idle tmux reaping
  7. Single 15-minute operator

  8. Keep: luci-operator.
  9. Purpose: durable decisions, gate/reject review, technical follow-up tickets, dependency cleanup, scheduler/runtimes summary.
  10. Should not compete with the minute dispatcher.

  11. Cleanup lane

  12. Keep: queue-reaper, worktree-reaper.
  13. Decide whether reap-zombie-workers remains independent or becomes less frequent.

  14. Service/session lane

  15. Keep: persistent-luci-watchdog, rotate-luci-session, systemd-watchdog.
  16. Merge: persistent-luci-branch-guard into persistent watchdog.

  17. Scheduler-ticket creator

  18. Keep: scheduler-watchdog for durable MC tickets.
  19. Retire/demote: cron-watchdog, unless it has unique overdue coverage.

Concrete simplification candidates, in order

  1. Retire or event-trigger needs-input-pickup. Normal ticket-pickup already runs every minute and uses the same pickup lock.
  2. Fold triage-untriaged into the single intake loop or slow it to every 5 minutes. It usually processes zero.
  3. Merge persistent-luci-branch-guard into persistent-luci-watchdog. Same domain, tiny check.
  4. Retire cron-watchdog after confirming Hermes external watchdog + scheduler-watchdog cover the same overdue conditions. Avoid duplicate scheduler monitors.
  5. Fold orchestrator-board-sweep into luci-operator or keep only as a subroutine. Same board-hygiene domain.
  6. Reduce reap-zombie-workers to hourly or make it emergency fallback only. Pickup already does minute-level session reaping.
  7. Change luci-operator-tuner from nightly Elmar-facing tickets to weekly/anomaly-triggered Luci-owned reports. This cuts meta-noise.
  8. Retire mc-orchestrator-inbox-cleanup if insert-time auto-expiry is confirmed. It is legacy direct SQL.

Suggested end-state task list

Everything else should either be retired, merged, or downgraded to lower cadence.

Current “is it working?” answer