Benchmark cheaper GPT tiers for scheduled-task routing
Follow-up to MC-5006/MC-5015 scheduled-task provider eval and Elmar question about GPT 5.5 vs 5.4/5.3 cost tiers. Goal: extend the scheduled-task benchmark matrix with cheaper ...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge21h ago
Ticket is done; runtime is closed.·profile claude_opus_1m_high
Description
MC-5030
Follow-up to MC-5006/MC-5015 scheduled-task provider eval and Elmar question about GPT 5.5 vs 5.4/5.3 cost tiers.
Goal: extend the scheduled-task benchmark matrix with cheaper OpenAI/OpenRouter GPT tiers so premium Codex/Claude are reserved only where needed.
Current OpenRouter prices observed from /api/v1/models on 2026-06-12:
- openai/gpt-5.5: $5.00 / 1M input, $30.00 / 1M output
- openai/gpt-5.4: $2.50 / 1M input, $15.00 / 1M output
- openai/gpt-5.3-codex and openai/gpt-5.3-chat: $1.75 / 1M input, $14.00 / 1M output
- openai/gpt-5.4-mini: $0.75 / 1M input, $4.50 / 1M output
- openai/gpt-5.4-nano: $0.20 / 1M input, $1.25 / 1M output
- minimax/minimax-m3: $0.30 / 1M input, $1.20 / 1M output
Scope:
1. Add benchmark adapters for OpenRouter/OpenAI API models above where credentials are available.
2. Run the same five scheduled-task fixtures: Life Manager triage, digest, memory extraction, ops/watchdog, code-review lite.
3. Record exact model id, provider route, latency, parse rate, hard/soft score, and approximate per-run cost.
4. Compare against MiniMax M3, GLM 5.1, Kimi K2.6, Codex gpt-5.5 xhigh, and agy current results.
5. Produce routing recommendation: where 5.4/5.3/mini/nano are good enough, where MiniMax remains cheaper/better, and where premium Codex/Claude should be kept.
6. No production scheduled-task routing changes from this ticket; recommendations only.
Acceptance:
- Report and machine JSON under reports/provider_eval/.
- Focused tests pass.
- Cost table includes per-1M and sample scheduled-task cost for 20k input + 2k output.
- Explicitly say whether any GPT lower tier should replace MiniMax/GLM/Kimi for scheduled tasks.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· High
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
telegram (human)
Mission Control
21h ago
18h ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
Created from Elmar follow-up: compare GPT-5.5 vs cheaper GPT-5.4/5.3/mini/nano tiers and decide whether to add them to the scheduled-task routing benchmark before changing task profiles.
luci20h ago
Added requirement from Elmar: do not compare only API $/token. Include subscription-token / subscription-quota usage as a first-class metric versus GPT-5.5, especially for Codex/Claude-style subscription-backed routes. Output should separate: (a) direct API/OpenRouter marginal token cost, (b) subscription token/usage burn, (c) effective scheduled-task cost under our actual quota/budget constraints.
luci-board-manager20h ago
[visibility-only board-manager] This Luci-owned benchmark ticket surfaced while the mc-internal Kanban board is integrity-corrupt and recovery is parked under MC-5025. I am not launching MC pickup/runtime or creating a new internal Kanban card while the workbench DB is corrupt. Parked this ticket behind MC-5025; release/reroute after the Kanban DB is repaired or a controller selects a verified non-Kanban fallback.
luci-board-manager19h ago
[visibility-only controller] Cleared stale dependency on MC-5025: mc-internal Kanban DB repair is now done/verified. MC-5030 is not being dispatched in this tick because MC-5037 is the current active codexbuilder lane; parked as backlog for the next controller release. No MC runtime/pickup/send/harvest endpoint was used.
luci-board-manager18h ago
[visibility-only] Released MC-5030 now that MC-5037 is closed and mc-internal has capacity. Routed the benchmark outside MC to internal Kanban card t_3a8c9cbf (codexbuilder) in isolated worktree `/home/lucienne/workspace/_mc_internal_worktrees/MC-5030-gpt-tier-benchmark`; verified worker pid=3113432 cwd matches that worktree. MC is mirrored as waiting/kanban_active while the external worker produces report/JSON/test/commit evidence. No MC runtime/pickup/send/harvest endpoint was used.
luci-board-manager18h ago
[visibility-only gate] Closed MC-5030 after independent controller gate. Consumed mc-internal card t_3a8c9cbf: worker produced report/JSON/raw MiniMax outputs under reports/provider_eval/ and source branch origin/kb/MC-5030-gpt-tier-benchmark at abf316b39f05895eda7da36c18918dc554dd8fda. Controller verified artifacts, report content, JSON parse, `python3 -m pytest tests/test_provider_behavior_eval.py -q` => 4 passed, `git diff --check` passed, then cherry-picked/pushed the report artifacts to origin/master as 6bc71b14b3d954619b3cd47b50ef1aac84564ec8. No production scheduled-task routing changes were made; recommendation says leave routing unchanged until live GPT keys are available and fixture-scored. No MC runtime/pickup/send/harvest endpoint was used.