Ticket is done; runtime is closed.·profile claude_opus_1m_medium
Description
MC-4882
## RE-SCOPED 2026-06-09 (Lucienne): manifest approach — the warmer didn't work
The in-process TTL cache (MC-4873) + background warmer (first MC-4882 attempt, commit 6e19e1a) does NOT reliably kill the cold scan: verified live `/reports` = 20.8s after 90s idle (warmer doesn't keep the gthread workers' caches warm; -w2, no --preload). DO NOT just retry the thread. Replace with a DURABLE SHARED MANIFEST so a full 340-file scan NEVER happens in the request path.
## Design
- Manifest file: `PKA-Outputs/_state/reports-index.json` (Luci path `/home/lucienne/gdrive/PKA-Outputs/_state/reports-index.json`) listing every report: {rel, title, date, type, mtime, base}.
- MC `/reports` LISTING reads the manifest (fast JSON) instead of scanning + reading ~340 files. It's a FILE shared by all workers → worker-count-proof (the thing that broke the in-proc cache).
- A `rebuild_reports_manifest()` scans both bases (cloud + ~/workspace/reports), writes JSON ATOMICALLY (tmp + os.replace). Triggered OUT of the request path: a scheduled refresh (every ~5-10 min) and/or after a report is written. The expensive scan runs in the builder, never in a user request.
- Listing falls back to the current live scan if the manifest is missing/unreadable (so it NEVER breaks). May keep the short TTL cache as a thin extra layer or remove it.
## Constraints
- NON-BREAKING: manifest absent/corrupt → fall back to live scan (current behavior). Atomic manifest write (tmp+rename); tolerant read.
- Don't change report SERVING. Preserve listing content + order (date desc) + full count.
- Do NOT deploy live yourself — produce review-required evidence; Lucienne (controller) guarded-deploys.
## Verification (before review-required)
- `/reports` cold (FRESH worker / after >TTL idle) < 2s — manifest read, NO 340-file scan (the actual goal that the warmer failed)
- A newly-added report appears in the listing within the refresh window
- Listing count + order unchanged vs live scan; a report serves 200; board / loads
- app.py parses; tests pass
## Terminal-state report REQUIRED: STATUS, 1-line, commit SHA, literal /reports cold-timing(seconds) + counts, follow-ups. Don't deploy live.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· Medium
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
api (human)
Mission Control
4d ago
4d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
REVIEW (Lucienne) — VERIFICATION FAILED, reopen recommended. MC-4882 warmer (commit 6e19e1a) landed+deployed by controller, BUT independent live test shows it does NOT fix the cold load: TTL=90s, warmer interval=75s, warms-immediately-then-every-75s (logic looks right), -w2 gthread, NO --preload. Test: GET /reports t0=0.0s (warm), sleep 95s (past TTL), GET /reports t+95s=20.8s (COLD scan again). So the warmer is not keeping the live workers' caches warm in practice — likely the import-time daemon thread isn't reliably running per gthread worker, OR warm_once() populates a different cache key/instance than the read path. Needs debug. Alternative per the code TODO: replace in-proc TTL cache with a durable shared manifest/index (_state/reports-index.json) so all workers share it and no scan is in the request path. No rollback needed (harmless — behaves as pre-4882). Lucienne to decide fix path with Elmar.
Lucienne4d ago
## RE-SCOPED 2026-06-09 (Lucienne): manifest approach — the warmer didn't work
The in-process TTL cache (MC-4873) + background warmer (first MC-4882 attempt, commit 6e19e1a) does NOT reliably kill the cold scan: verified live `/reports` = 20.8s after 90s idle (warmer doesn't keep the gthread workers' caches warm; -w2, no --preload). DO NOT just retry the thread. Replace with a DURABLE SHARED MANIFEST so a full 340-file scan NEVER happens in the request path.
## Design
- Manifest file: `PKA-Outputs/_state/reports-index.json` (Luci path `/home/lucienne/gdrive/PKA-Outputs/_state/reports-index.json`) listing every report: {rel, title, date, type, mtime, base}.
- MC `/reports` LISTING reads the manifest (fast JSON) instead of scanning + reading ~340 files. It's a FILE shared by all workers → worker-count-proof (the thing that broke the in-proc cache).
- A `rebuild_reports_manifest()` scans both bases (cloud + ~/workspace/reports), writes JSON ATOMICALLY (tmp + os.replace). Triggered OUT of the request path: a scheduled refresh (every ~5-10 min) and/or after a report is written. The expensive scan runs in the builder, never in a user request.
- Listing falls back to the current live scan if the manifest is missing/unreadable (so it NEVER breaks). May keep the short TTL cache as a thin extra layer or remove it.
## Constraints
- NON-BREAKING: manifest absent/corrupt → fall back to live scan (current behavior). Atomic manifest write (tmp+rename); tolerant read.
- Don't change report SERVING. Preserve listing content + order (date desc) + full count.
- Do NOT deploy live yourself — produce review-required evidence; Lucienne (controller) guarded-deploys.
## Verification (before review-required)
- `/reports` cold (FRESH worker / after >TTL idle) < 2s — manifest read, NO 340-file scan (the actual goal that the warmer failed)
- A newly-added report appears in the listing within
luci-board-manager4d ago
Visibility-only controller routed this Luci-owned implementation ticket to internal Kanban `t_b6e967cb`. I created isolated worktree `/home/lucienne/workspace/state/kanban-worktrees/mc-4882-reports-cache-warmer` from Mission Control origin/master, then started with `claudebuilder`; claudebuilder immediately hit non-retryable Anthropic extra-usage HTTP 400, so I reclaimed the same card and rerouted to `codexbuilder`. Verified codexbuilder run 195 pid 3253607 is alive in that isolated worktree. MC stays `waiting` / `kanban_active:t_b6e967cb`. No MC pickup/runtime endpoint was called.
luci-board-manager4d ago
Controller gate passed and deployed MC-4882. Evidence: internal Kanban `t_b6e967cb` commit `6e19e1aa8c965fbe4090fe0b4d0874f7abee73a5` was independently reviewed, fast-forwarded to Mission Control `origin/master`, focused reports tests passed (`tests/test_mc4867_reports_cloud.py` => 12 passed), `app.py` AST parsed, `graphify update .` completed, gunicorn was reloaded with HUP after systemctl restart required interactive auth, and live local smoke on port 3001 passed: `/api/health` 200, `/` 200, `/reports` 200, then six `/reports` repeats at ~0.011-0.019s. No MC runtime/pickup/send/harvest endpoint was used.
luci-board-manager4d ago
[visibility-only controller] Re-scoped cold `/reports` fix is now routed to internal Hermes Kanban card `t_a3b65425` on board `mc-internal` for the durable shared manifest/index approach. The previous warmer evidence was rejected by independent cold-idle timing (~20.8s after TTL), so this worker is replacing request-path scans with an atomic shared manifest and fallback-to-scan behavior. Verified codexbuilder run 201 is alive in isolated worktree `/home/lucienne/workspace/state/kanban-worktrees/mc-4882-reports-manifest-20260609134156` (pid 3363233 cwd matches). MC remains visibility-only: no MC runtime/pickup/send/harvest endpoint was used.
luci-board-manager4d ago
[visibility-only controller] Cleared stale terminal `done_sha` from the rejected warmer attempt; MC-4882 is open again and now blocked only by internal Kanban card `t_a3b65425` for the durable manifest fix.
luci-board-manager4d ago
[visibility-only controller] Gate passed and deployed MC-4882 durable reports manifest fix. Internal Kanban `t_a3b65425` worker commit `881d716` was cherry-picked to Mission Control `master` as `3762cf9`, pushed to origin/master, and deployed by restarting `luci-dashboard.service` via controlled gunicorn/systemd restart after systemctl restart required interactive auth. Evidence: focused tests `tests/test_mc4867_reports_cloud.py tests/test_breakglass_api_probe_window.py` passed (18/18), manifest build wrote `/home/lucienne/gdrive/PKA-Outputs/_state/reports-index.json` with 303 reports, `graphify update .` completed, live `/api/health` 200, live `/api/reports` 200 in 0.3588s with 303 reports from a fresh restarted service, and live `/reports` 200 in 0.0495s. No MC runtime/pickup/send/harvest endpoint was used.