You're offline — showing cached data

MC-5030

Benchmark cheaper GPT tiers for scheduled-task routing
2026-06-13 07:36:40 SAST
Home Board MC-5030

Benchmark cheaper GPT tiers for scheduled-task routing

Follow-up to MC-5006/MC-5015 scheduled-task provider eval and Elmar question about GPT 5.5 vs 5.4/5.3 cost tiers. Goal: extend the scheduled-task benchmark matrix with cheaper ...
State Done Next Action Closed Owner Luci Runtime Closed Age 21h ago
MC-5030
Ticket is done; runtime is closed. · profile claude_opus_1m_high

Description

MC-5030
Follow-up to MC-5006/MC-5015 scheduled-task provider eval and Elmar question about GPT 5.5 vs 5.4/5.3 cost tiers. Goal: extend the scheduled-task benchmark matrix with cheaper OpenAI/OpenRouter GPT tiers so premium Codex/Claude are reserved only where needed. Current OpenRouter prices observed from /api/v1/models on 2026-06-12: - openai/gpt-5.5: $5.00 / 1M input, $30.00 / 1M output - openai/gpt-5.4: $2.50 / 1M input, $15.00 / 1M output - openai/gpt-5.3-codex and openai/gpt-5.3-chat: $1.75 / 1M input, $14.00 / 1M output - openai/gpt-5.4-mini: $0.75 / 1M input, $4.50 / 1M output - openai/gpt-5.4-nano: $0.20 / 1M input, $1.25 / 1M output - minimax/minimax-m3: $0.30 / 1M input, $1.20 / 1M output Scope: 1. Add benchmark adapters for OpenRouter/OpenAI API models above where credentials are available. 2. Run the same five scheduled-task fixtures: Life Manager triage, digest, memory extraction, ops/watchdog, code-review lite. 3. Record exact model id, provider route, latency, parse rate, hard/soft score, and approximate per-run cost. 4. Compare against MiniMax M3, GLM 5.1, Kimi K2.6, Codex gpt-5.5 xhigh, and agy current results. 5. Produce routing recommendation: where 5.4/5.3/mini/nano are good enough, where MiniMax remains cheaper/better, and where premium Codex/Claude should be kept. 6. No production scheduled-task routing changes from this ticket; recommendations only. Acceptance: - Report and machine JSON under reports/provider_eval/. - Focused tests pass. - Cost table includes per-1M and sample scheduled-task cost for 20k input + 2k output. - Explicitly say whether any GPT lower tier should replace MiniMax/GLM/Kimi for scheduled tasks.

Activity

done
Luci is working...
Live
No activity yet
Help