MC-5030 — Benchmark cheaper GPT tiers for scheduled-task routing

Home Board MC-5030

Benchmark cheaper GPT tiers for scheduled-task routing

Follow-up to MC-5006/MC-5015 scheduled-task provider eval and Elmar question about GPT 5.5 vs 5.4/5.3 cost tiers. Goal: extend the scheduled-task benchmark matrix with cheaper ...

State Done Next Action Closed Owner Luci Runtime Closed Age 21h ago

← MC-5030

Ticket is done; runtime is closed. · profile claude_opus_1m_high

Description

MC-5030

Follow-up to MC-5006/MC-5015 scheduled-task provider eval and Elmar question about GPT 5.5 vs 5.4/5.3 cost tiers. Goal: extend the scheduled-task benchmark matrix with cheaper OpenAI/OpenRouter GPT tiers so premium Codex/Claude are reserved only where needed. Current OpenRouter prices observed from /api/v1/models on 2026-06-12: - openai/gpt-5.5: $5.00 / 1M input, $30.00 / 1M output - openai/gpt-5.4: $2.50 / 1M input, $15.00 / 1M output - openai/gpt-5.3-codex and openai/gpt-5.3-chat: $1.75 / 1M input, $14.00 / 1M output - openai/gpt-5.4-mini: $0.75 / 1M input, $4.50 / 1M output - openai/gpt-5.4-nano: $0.20 / 1M input, $1.25 / 1M output - minimax/minimax-m3: $0.30 / 1M input, $1.20 / 1M output Scope: 1. Add benchmark adapters for OpenRouter/OpenAI API models above where credentials are available. 2. Run the same five scheduled-task fixtures: Life Manager triage, digest, memory extraction, ops/watchdog, code-review lite. 3. Record exact model id, provider route, latency, parse rate, hard/soft score, and approximate per-run cost. 4. Compare against MiniMax M3, GLM 5.1, Kimi K2.6, Codex gpt-5.5 xhigh, and agy current results. 5. Produce routing recommendation: where 5.4/5.3/mini/nano are good enough, where MiniMax remains cheaper/better, and where premium Codex/Claude should be kept. 6. No production scheduled-task routing changes from this ticket; recommendations only. Acceptance: - Report and machine JSON under reports/provider_eval/. - Focused tests pass. - Cost table includes per-1M and sample scheduled-task cost for 20k input + 2k output. - Explicitly say whether any GPT lower tier should replace MiniMax/GLM/Kimi for scheduled tasks.

Activity

done

Luci is working...

Details — Done · High · Luci ▼

State

Done

Closed

State

Priority

People

Owner (assigned to)

Controller

Timing / Details

Due Date

Snooze

Source telegram (human)

Project Mission Control

Created 21h ago

Updated 18h ago

Advanced / Operator evidence

Routing owner

Routes via

Operator console

Evidence

Ticket is done; runtime is closed. MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.

Workflow

Start Dev Review + QA ▾

Select phases to include:

Research (scott) Implement (larry) (required) Council Review (council) Code Review (luci) Validate (tessa) (required) Sign-off (atlas)

Agents

Review Gates

Decision

WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.

luci 21h ago

Created from Elmar follow-up: compare GPT-5.5 vs cheaper GPT-5.4/5.3/mini/nano tiers and decide whether to add them to the scheduled-task routing benchmark before changing task profiles.

luci 20h ago

Added requirement from Elmar: do not compare only API $/token. Include subscription-token / subscription-quota usage as a first-class metric versus GPT-5.5, especially for Codex/Claude-style subscription-backed routes. Output should separate: (a) direct API/OpenRouter marginal token cost, (b) subscription token/usage burn, (c) effective scheduled-task cost under our actual quota/budget constraints.

luci-board-manager 20h ago

[visibility-only board-manager] This Luci-owned benchmark ticket surfaced while the mc-internal Kanban board is integrity-corrupt and recovery is parked under MC-5025. I am not launching MC pickup/runtime or creating a new internal Kanban card while the workbench DB is corrupt. Parked this ticket behind MC-5025; release/reroute after the Kanban DB is repaired or a controller selects a verified non-Kanban fallback.

luci-board-manager 19h ago

[visibility-only controller] Cleared stale dependency on MC-5025: mc-internal Kanban DB repair is now done/verified. MC-5030 is not being dispatched in this tick because MC-5037 is the current active codexbuilder lane; parked as backlog for the next controller release. No MC runtime/pickup/send/harvest endpoint was used.

luci-board-manager 18h ago

[visibility-only] Released MC-5030 now that MC-5037 is closed and mc-internal has capacity. Routed the benchmark outside MC to internal Kanban card t_3a8c9cbf (codexbuilder) in isolated worktree `/home/lucienne/workspace/_mc_internal_worktrees/MC-5030-gpt-tier-benchmark`; verified worker pid=3113432 cwd matches that worktree. MC is mirrored as waiting/kanban_active while the external worker produces report/JSON/test/commit evidence. No MC runtime/pickup/send/harvest endpoint was used.

luci-board-manager 18h ago

[visibility-only gate] Closed MC-5030 after independent controller gate. Consumed mc-internal card t_3a8c9cbf: worker produced report/JSON/raw MiniMax outputs under reports/provider_eval/ and source branch origin/kb/MC-5030-gpt-tier-benchmark at abf316b39f05895eda7da36c18918dc554dd8fda. Controller verified artifacts, report content, JSON parse, `python3 -m pytest tests/test_provider_behavior_eval.py -q` => 4 passed, `git diff --check` passed, then cherry-picked/pushed the report artifacts to origin/master as 6bc71b14b3d954619b3cd47b50ef1aac84564ec8. No production scheduled-task routing changes were made; recommendation says leave routing unchanged until live GPT keys are available and fixture-scored. No MC runtime/pickup/send/harvest endpoint was used.

Live ▼

No activity yet

←