MC-4942 — Scheduler fix wave: 35 verified findings, 12 work units (2026-06-10 review)

Home Board MC-4942

Scheduler fix wave: 35 verified findings, 12 work units (2026-06-10 review)

Fix wave from the 2026-06-10 multi-agent scheduled-tasks review (reports/orchestration-audit/2026-06-10/scheduled-tasks-review.md — 35 verified items: 6 broken, 13 fragile, 16 s...

State Done Next Action Closed Owner Luci Runtime Closed Age 2d ago

← MC-4942

Ticket is done; runtime is closed. · profile claude_opus_1m_medium

Description

MC-4942

Fix wave from the 2026-06-10 multi-agent scheduled-tasks review (reports/orchestration-audit/2026-06-10/scheduled-tasks-review.md — 35 verified items: 6 broken, 13 fragile, 16 simplify). Executed interactively with Elmar in the persistent session, one work unit at a time, with per-task redundancy checks against the new orchestration model (board-manager single dispatch path, MC-4913 close gate, orchestration-guard). 12 work units: U1 exit-75+warning UI (DEADLINE Sat 06-13), U2 redundant-task triage (Elmar decisions), U3 lock-lifecycle hardening, U4 /runs forensics, U5 cost telemetry, U6 task-detail page, U7 schedule display, U8 serial-tick stalls, U9 scheduler.py dup excision, U10 /tasks UX, U11 logging hygiene, U12 metadata honesty. Effort: broken+fragile ~4-5 dev-days, simplify ~3 more. Expected check-in: per-unit, same session(s).

Activity

done

Luci is working...

Details — Done · High · Luci ▼

State

Done

Closed

State

Priority

People

Owner (assigned to)

Controller

Timing / Details

Due Date

Snooze

Source api (human)

Project Mission Control

Created 2d ago

Updated 2d ago

Advanced / Operator evidence

Routing owner

Routes via

Operator console

Evidence

Ticket is done; runtime is closed. MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.

Workflow

Start Dev Review + QA ▾

Select phases to include:

Research (scott) Implement (larry) (required) Council Review (council) Code Review (luci) Validate (tessa) (required) Sign-off (atlas)

Agents

Review Gates

Decision

WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.

luci-board-manager 2d ago

[visibility-only] Controller consumed blocked review-only Kanban card t_7518114e and routed authorized landing/deploy gate to internal Kanban card t_f5fedf23. Scope is MC-4942 U1 warning/exit-75 scheduler slot accounting only; worker uses isolated cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u1-warning-slot. MC remains ledger-only; no MC pickup/runtime endpoints used.

luci-board-manager 2d ago

[visibility-only] U1 scheduler-slot fix from t_f5fedf23 was independently verified: origin/master for /home/lucienne/workspace is 9e5af5e72f931a6ab452a844e725f35a0ac2a671, focused SchedulerWarningExitTests passed (3 tests OK), py_compile passed, and live scheduler.py contains the warning terminal-status filter. Remaining U1 warning-status UI/badge work has been routed to internal Kanban child card t_2342d4e1 in isolated Mission Control worktree /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u1-warning-ui (pid 1104544). MC remains ledger-only; no MC runtime/pickup endpoints used.

api 2d ago

U1 DONE + deploy-verified (2026-06-10 23:42). Saturday refire flood defused. Engine: scheduler.py last_run_time counts status='warning' as slot-consuming. Behaviorally verified (warning row at Sat 04:00 -> not due 04:01, due 06:00). Committed to workspace master. Scheduler reads file fresh per tick -> already live. UI: kanban worker commit ec7b904 landed (amber badge-warning CSS, legend entry, Problems bucketing + warning breakdown chip, fail-reason preview for warning rows, 2 tests). Live-served, verified. PROCESS NOTE: the engine edit was made INLINE by the board-manager cron tick at 23:03 — violates the router-only HARD DELEGATION BUDGET injected earlier today (prompt rule alone insufficient; fix was correct, but enforcement needs the toolset/post-tick layer). Logged for D follow-up. The UI piece it correctly delegated via kanban cards (t_2342d4e1 worker + t_07676304 landing gate) — half-compliance. Next: U2 redundant-task triage with Elmar.

luci-board-manager 2d ago

[visibility-only] Controller normalized stale Kanban state after internal card t_07676304 completed and U1 was already deploy-verified. MC-4942 remains waiting for U2 redundant-task triage decisions with Elmar before releasing the next implementation unit. No MC pickup/runtime endpoints were used; internal Kanban currently has no active MC-4942 worker.

luci-board-manager 2d ago

[visibility-only] Luci board manager routed the first bounded unit for this scheduler-fix wave to internal Kanban mc-internal card t_2ca56d84. Scope: U1 urgent exit-75/warning slot-consumption fix only, in isolated top-level worktree /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u1-warning-slot using codexbuilder. MC remains ledger-only; no MC pickup/runtime endpoints were used. Next gate: controller reviews the Kanban handoff/commit/test evidence, then releases the next unit.

luci-board-manager 2d ago

[visibility-only] Parent Kanban card t_2ca56d84 produced a review-required scheduler-slot fix handoff (commit 9e5af5e72f931a6ab452a844e725f35a0ac2a671). The controller consumed that handoff, marked the parent card complete, and routed independent review/landing to internal Kanban card t_7518114e. Verified t_7518114e is running under codexbuilder in isolated cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u1-warning-slot (pid 1074960). MC remains ledger-only; no MC pickup/runtime endpoints were used.

luci-board-manager 2d ago

[visibility-only] Controller consumed the review-required handoff from internal Kanban parent t_2342d4e1 and routed independent review/landing/deploy verification to child card t_07676304. Verified t_07676304 is running under codexbuilder in isolated cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u1-warning-ui with PID 1115793. MC remains ledger-only; no MC pickup/runtime endpoints were used. No human input needed unless the child finds a product/spend/credential blocker.

api 2d ago

U2 DONE (2026-06-11 07:35, Elmar-approved all 5). 121 -> 116 live tasks. RETIRED (archived to tasks/_disabled_archive/mc-4942-u2-retired/, sha256 manifest + restore): mc-health-check-hourly (no-op empty command), mc-wal-checkpoint (weaker dup of per-tick TRUNCATE), mc-orchestrator-inbox-cleanup (impossible target rows), probe-runtime-profiles (dup of daily runtime-health-probe), cleanup-stale-worktrees (0 removals/33d, superseded). CONSOLIDATED REAPER: worktree-reaper now DAILY 04:30, covers mission-control repo incl. state/kanban-worktrees (the gap that needed yesterday's manual 31GB cleanup); terminal+clean worktrees safe even when cherry-pick-ahead (commits preserved on branch); pool guard scoped to canonical pool-{0,1,2} only so nested-garbage copies get reaped. Profile lie fixed (claude_glm -> direct_no_llm). pool-snapshot-purge + pool-remote-prune KEPT (distinct jobs). LIVE RESULT from first run: 48 worktrees -> 14 (all legitimately manual: canonical pool, active tickets, dirty trees), 6.4GB -> 1.9GB. Scheduler sanity: 116 root tasks, 0 disabled-in-root, profile lint clean. Wiki inventory updated. Workspace commit af546f7d.

api 2d ago

U3 DONE + deploy-verified (2026-06-11 07:58). Lock lifecycle hardened — one policy, five actors aligned. Scheduler half (workspace 4d5bbf9 + deflake commit): reconcile uses release_lock_if_owned (Jun-6 double-run class closed); kill_stale verifies /proc cmdline identity before SIGTERM (scheduler token, task id, or luci-bg-<run_id> wrapper token — agent's sound extension preventing MC-3640-style orphans), PermissionError never strands a lock; comment-skip closes its own lastrowid not MAX(id); single _parse_run_ts naive=UTC policy at all three parse sites; heartbeat clean_stale_locks keeps live-PID locks regardless of age (the reattributed Jun-6 root cause). 19 tests, deflaked (fork->exec /proc window), 8/8 stable runs. Live next tick — verified ticks flowing normally (26 runs/10min = expected cadence). /api/run half (MC repo bd7c7ef): stale threshold = task timeout+60 (was flat 120s double-run hazard); detached Popen dispatch, immediate {ok,started} (no 600s SIGKILL orphaning, no false timeout stamps); unknown->404, disabled->400; response shape backward-compatible. 8 new tests + regression suite green. Deployed (luci-dashboard restarted); LIVE verified: POST dispatched, run row completed, repeat-POST safe. NEW FINDING logged during verify: task_runs.started_at stores ISO-T+02:00 strings; any SQL comparing them to sqlite datetime('now') (space-separated UTC) string-compares garbage — my own monitoring query tripped it. Sweep app.py/models.py for datetime('now') comparisons against started_at — fold into U4/U9. Follow-up noted: release_lock_if_owned internal naive=SAST parse is a 4th tz site (benign today, both writers emit aware) — fold into U9.

api 2d ago

U4 DONE + deploy-verified (2026-06-11 08:40). /runs is now a usable forensics surface. Landed (agent commit 57fb9dd + orchestrator count fix): server-side status/task filters + pagination (tabs are real queries — the Jun-10 canary failure is findable, not buried under poller noise); honest tab counts (Tessa caught a window mismatch: badge counted 7-day cutoff while lists showed all retained rows — aligned, regression-tested); failure previews extract the actual error (last [FAIL]/Error/Traceback line, else tail — no more progress chatter masking the cause); mobile failed rows show a visible red reason line; every relative time carries absolute SAST (hover + visible secondary). BONUS from the datetime('now') sweep: 10 real string-comparison bugs fixed across models.py/app.py — including ticket-create dedup whose "10-second window" actually matched ANY same-UTC-day ticket (app.py:8872). 4 false-positives checked and documented. Tessa segment: all 4 original findings verified fixed; 1 Major (count mismatch) found + fixed + re-verified live (badge 5 = DB 5); 1 Minor noted (radio-brief task fails without emitting an error line — task-side, preview fallback behaves correctly). Tests: 22 (U4 file) + touched suites green; 2 pre-existing failures verified identical on clean master. Deployed, restarted, live-verified.

luci-board-manager 2d ago

[visibility-only] Controller reconciled MC-4942 after U2, U3, and U4 were already recorded as DONE/deploy-verified while the row still showed the stale U2 human-decision blocker. Routed the next bounded unit, U5 cost telemetry, to internal Kanban mc-internal card t_f890daa8 under codexbuilder. Verified the worker is running in isolated cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u5-cost-telemetry-root (with a separate prepared MC app worktree for app/UI changes). MC remains ledger-only; no MC pickup/runtime endpoints were used. No human input is needed unless the worker finds a product/spend/credential blocker.

luci-board-manager 2d ago

[visibility-only] Controller rejected internal Kanban parent t_f890daa8 as a read-only false stop: it completed with no changed files/tests despite implementation authorization for U5 cost telemetry. Routed implementation continuation to mc-internal card t_5b213a49 under codexbuilder and verified it is running in isolated cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u5-cost-telemetry-root (pid 1597435). MC remains ledger-only; no MC pickup/runtime endpoints were used. No human input needed unless the worker finds a genuine product/spend/credential blocker.

api 2d ago

U5a DONE + deploy-verified (2026-06-11 09:15). Cost telemetry resurrected after 11 dead days. Landed (agent 9cf5be7 + spend-ingest task): scripts/spend_ingest.py — Lane 1 hermes state.db ended sessions (actual/estimated cost, read-only, watermark), Lane 2 ~/.claude transcripts incremental (per-file byte-offset watermark, message-id dedup; full backfill 64s, increment 2.3s). spend_tracker: per-lane accrual (api real-$ / subscription est-equivalent / unknown), freshness, additive schema (MC-4557 guardrail consumers safe — verified shape backward-compat). /cost: per-lane ledger + "Ledger through <day>"; uncovered days say no-data, never $0. Home chip: real-$ budget + sub-equiv split. append_cost_event except-pass now logs. Daily scheduler task spend-ingest (04:20, direct_python, lint clean). THE NUMBERS (June 1-11 backfill): API-billed real $0.49 · subscription token use $6,415 est-equivalent (peak Jun 4 $1,594/day; Jun 10 $554+$73) · unknown $6.51. The old surface showed $0.03. 48 tests green; deployed; live-verified (/cost lanes + freshness, chip sub-equiv). U5b parked (follow-up): provider-statement reconciliation; re-lane GLM/Kimi if they're flat plans not per-token; crashed hermes sessions (no ended_at) never ingested; scheduler.py _record_task_cost_event cleanup folds into U9. Agent transparency note: one render smoke hit live mc.db read-path during dev (no writes observed) — flagged, acceptable.

api 2d ago

U6+U7+U8 DONE + deploy-verified (2026-06-11 09:50). Parallel batch, three agents, disjoint territories. U6 (task detail, aaee4fe): mobile card collapse — scrollWidth 375 at 375px (was 792); stats relabeled Runs(7d)/Success(7d) with retention tooltips; description renders markdown (mirrors /md-view trust model); failure preview + abs times in run history. Tessa: APPROVED (cross-checked stats vs mc.db exact). U7 (schedule humanizer, 34b0f7f): comma-hour dow mangle fixed ('Sat at 04:00, 06:00' live); 14 labels improved, 0 Custom remaining on /tasks (Tessa full-scan of 117 rows: zero raw crons); numeric sort_time; 61 unchanged shapes verified identical; 57 tests. U8 (tick stalls, 96af9c5): 12 long tasks backgrounded with per-task p95 evidence (memory-extractor 994s was the biggest stall source; life-manager-scan was blinding scheduler-watchdog 3min EVERY hour); deliberate keep-foreground list documented (retry semantics, notify_on:success, same-tick ordering deps — bg path skips those); ~50 high-timeout/low-p95 tasks left alone. bg failure notification VERIFIED working pre-sweep. 18 bg tasks total now. Tessa minors logged: #8 server-local paths on detail page (-> U11), #9 FAILED stat conflates timeout (-> U10 taxonomy). U9+ flags accumulated: reconcile notify_to topic routing lost on bg failure (stub task dict); dream-cycle/b4i remain the foreground ceiling; release_lock_if_owned 4th tz site; scheduler _record_task_cost_event cleanup.

codexbuilder 2d ago

MC-4942 U5 controller gate complete. Reviewed/landed: - Root scheduler original d5b64646659f0579490a646da9abf11b4aef7bf6 re-landed on current root origin/master as cff4b662a34cf499be83144ec9420d00837a23dc. - MC app original 17980681230cb39bfb0efd3d2d1a8d7bf2fda3fb had acceptance gaps; fixed in cc6d265 then re-landed on current mission-control origin/master as 1df6ba4 + 54cb08fa858bdaaf1c60126eccb26f3da7c04f0f. Acceptance evidence: - Root scheduler no longer fabricates missing telemetry as authoritative cost_usd=0.0; unknowns are omitted or cost_source='unknown', and cost POST/session update failures surface via logs/stderr telemetry. - /cost labels totals as known cost, counts unknown turns, warns that missing scheduler/runtime usage is not $0, includes recent gap evidence, and preserves U5a spend-ledger lane semantics. - Adjacent runtime metadata now says known cost plus unknown-cost turn counts instead of plain cost $0.0000 for unknown-only sessions. Tests/evidence: - Root isolated + live selected files: 6 focused unittest checks OK. - MC isolated + live selected files: tests/test_mc4491_spend_tracker.py + tests/test_mc4942_cost_telemetry.py OK, 18 passed. - py_compile app.py/models.py/spend_tracker.py OK; Jinja cost.html loads OK. - Live deploy: selected reviewed files copied into dirty live checkouts after target-file dirt check; cmp verified live selected files match landing worktrees. Mission Control gunicorn HUP reload done; /api/health healthy; /cost?days=7 renders and contains Telemetry incomplete, Recent gap evidence, and Spend ledger. No MC pickup/runtime/send/harvest endpoints were used. graphify update was attempted but the installed launcher is broken (ModuleNotFoundError: graphify).

api 2d ago

CLOSED done + verified. ALL 12 WORK UNITS COMPLETE (2026-06-10 22:19 -> 2026-06-11 11:40). Final state of the 35 verified review findings: 6/6 broken FIXED, 13/13 fragile FIXED, 14/16 simplify DONE (2 true minors remain, logged below). U1 exit-75 slot + warning UI (Sat flood defused, behaviorally verified). U2 5 tasks retired + consolidated daily worktree-reaper covering kanban-worktrees (48->14 worktrees, 6.4->1.9GB; Elmar-approved). U3 lock lifecycle: one policy, 5 actors (Jun-6 double-run class closed; 27 tests). U4 /runs forensics: server-side filters/pagination/honest counts, tail-based error previews, mobile fail lines, abs times + 10 datetime('now') string-compare bugs fixed incl. ticket dedup matching any-same-day. U5a cost telemetry: hermes + CLI-transcript ingest, per-lane honest /cost + chip ($0.03 lie -> $6,415 MTD sub-equiv visible), daily spend-ingest task; ledger untracked from git after a reset clobbered it once. U6 task detail: 375px mobile (was 792), Runs(7d) labels, markdown descriptions. U7 schedule humanizer: 0 raw crons on /tasks (Tessa full-scan). U8 12 long tasks backgrounded with p95 evidence (watchdog blindness ended); deliberate keep-foreground list. U9 scheduler.py -173 lines (11 shadowed defs excised, AST-guarded), timestamped tick logs, bg notify_to routing restored, fresh-DB token DDL fixed. U10 /tasks UX: cross-tab search, plain language, one status vocabulary, in-scroll pagination, -22% DOM (recovered after spend-limit kill — dead agent's staged work verified+landed). U11 run-log page is a real page (?raw=1 contract kept), no server paths, daily copytruncate log rotation (verified against live O_APPEND fds). U12 35 task profiles made honest + lint inverse check + graveyard comments. GATES: board canary 13/13 consoleErrors=0; Tessa final verdict APPROVED FOR ELMAR (segments: runs, task-detail, schedule labels, tasks-UX all approved; her stale carry-overs #3/#8 re-verified fixed live). TRUE REMAINDERS (minor, to MC-4912 roadmap): radio-daily-brief fails without emitting an error line (task-side); task-detail FAILED stat aggregates timeout+failed. Plus U5b (provider-statement reconciliation) parked. PROCESS NOTES: one agent killed by monthly spend limit mid-U10 (staged work recovered, zero loss); board-manager landed cost-telemetry commits concurrently during the wave (aligned via origin, ledger rebuilt); controller inline-edit violation (Jun-10 23:03) still open as D-enforcement follow-up.

luci-board-manager 2d ago

[visibility-only] Controller consumed blocked internal Kanban handoff t_5b213a49 as review-required implementation evidence (root commit d5b64646659f0579490a646da9abf11b4aef7bf6; MC app commit 17980681230cb39bfb0efd3d2d1a8d7bf2fda3fb) and routed the controller gate/landing lane to mc-internal card t_a0002677. claudebuilder could not start because Anthropic returned HTTP 400 extra-usage quota, so the same card was reclaimed/reassigned to codexbuilder. Dispatch verified running pid 1664550 with cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4942-u5-cost-telemetry-root. MC remains visibility-only; no MC pickup/runtime endpoints were used. No human input needed unless the gate finds a genuine product/spend/credential blocker.

Live ▼

No activity yet

←