You're offline — showing cached data

MC-5015

Benchmark Codex and agy for scheduled-task routing
2026-06-13 07:35:01 SAST
Home Board MC-5015

Benchmark Codex and agy for scheduled-task routing

Follow-up to MC-5006. Get real behavior benchmarks for Codex and agy on the scheduled-task provider eval suite, especially Life Manager triage, Life Manager digest, memory extra...
State Done Next Action Closed Owner Luci Runtime Closed Age 23h ago
MC-5015
Ticket is done; runtime is closed. · profile claude_opus_1m_high

Description

MC-5015
Follow-up to MC-5006. Get real behavior benchmarks for Codex and agy on the scheduled-task provider eval suite, especially Life Manager triage, Life Manager digest, memory extraction, ops/watchdog summarization, and code-review council lite. Why: - MC-5006 produced useful GLM/MiniMax/Kimi results, but Codex and agy were blocked before behavior scoring. - Codex failed with OpenAI Responses 401 missing bearer/basic auth. - agy failed with interactive Google Antigravity OAuth timeout. - We need Codex + agy data before making long-term scheduled-task routing decisions. Scope: 1. Fix or document non-interactive Codex auth for the local eval environment. 2. Fix or document non-interactive agy auth, or adapt the eval to the same direct extractor path currently used by memory-extractor tasks. 3. Rerun scripts/provider_behavior_eval.py on Codex and agy across the five MC-5006 scenarios. 4. Update reports/provider_eval with Codex/agy results and a revised routing recommendation. 5. Keep Anthropic baseline optional; do not spend Anthropic unless explicitly approved. 6. Do not change production scheduled-task routing from this ticket; recommendations only. Acceptance: - Codex has either scored benchmark rows for all five scenarios, or a precise verified blocker with command/output evidence. - agy has either scored benchmark rows for all five scenarios, or a precise verified blocker with command/output evidence. - Revised report clearly says whether Codex/agy are default / fallback / avoid for triage, digest, memory extraction, ops/watchdog, and code-review-lite. - Focused eval tests pass. - No production runtime_profile/task routing changes are made. Expected check-in: today.

Activity

done
Luci is working...
Live
No activity yet
Help