MC-5006 — Evaluate non-Claude providers for scheduled tasks

Home Board MC-5006

Evaluate non-Claude providers for scheduled tasks

Build and run a small, repeatable model-routing eval for Luci scheduled-task workloads. Goal: Compare GLM, MiniMax, Kimi, Codex, and agy on the types of scheduled tasks Luci ac...

State Done Next Action Closed Owner Luci Runtime Closed Age 1d ago

← MC-5006

Ticket is done; runtime is closed. · profile claude_opus_1m_high

Description

MC-5006

Build and run a small, repeatable model-routing eval for Luci scheduled-task workloads. Goal: Compare GLM, MiniMax, Kimi, Codex, and agy on the types of scheduled tasks Luci actually runs, so Elmar can decide long-term routing without relying on gut feel. Keep Anthropic as a baseline only where quota allows; do not consume Anthropic unnecessarily before the F1 window finishes. Context: - Existing smoke test only checks provider liveness: scripts/provider_smoke_test.py. - Need behavior/quality eval, not just PONG. - User specifically wants scheduled-task types, not generic benchmarks. - Memory extractor already uses agy via PKA_EXTRACTION_PROVIDER=agy; include that task class but do not mislabel it as Claude. Candidate task classes / scenarios: 1. Life Manager triage scan: classify incoming email/WhatsApp snippets into urgent, FYI, action, ignore; score precision/recall against a hand-labeled fixture. 2. Life Manager digest: produce a concise Telegram-ready digest from fixture events; score completeness, brevity, actionability, no hallucinated items. 3. Memory extraction: extract durable memories from short session transcripts; score keep/drop correctness and no stale/task-progress memories. 4. Support intake / ticket routing: turn inbound support text into ticket/ignore/escalate decisions; score routing and required fields. 5. Scheduled ops/watchdog summarization: read synthetic logs/task history and decide alert vs silent; score false positives/negatives. 6. Code-review council lite: review a small fixture diff; score whether known seeded bugs are caught. 7. Research/digest synthesis: summarize a source pack into a short brief; score citation grounding and decision usefulness. Provider runners to support: - GLM via claude-provider-env.sh glm / claude CLI profile where valid. - MiniMax via claude-provider-env.sh minimax MiniMax-M3. - Kimi via Kimi direct helper where appropriate, and optionally Claude Code harness if known-good. - Codex via CLI/API runner for coding/review scenarios. - agy via agy -p for extraction/digest-style scenarios. - Optional Anthropic baseline after F1 safety window or on a tiny sample only. Eval design: - Use fixtures committed under a repo-appropriate path, e.g. tests/provider_eval/fixtures/. - For each scenario, store: prompt, input fixture, gold labels or rubric, max latency, allowed output schema. - Capture: output, latency, exit status, approximate cost/quota label, parse success. - Score hard assertions first: JSON/schema validity, required fields, no invented IDs, exact labels where possible. - Score soft rubrics second, preferably with a non-tested judge or human review summary. - Produce a Markdown report with per-provider recommendation by task class: default / acceptable fallback / avoid. Acceptance: - Harness can run a small eval suite locally on Luci without altering live scheduled tasks. - Run at least one representative scenario from each of: triage, digest, memory extraction, ops/watchdog, code review. - Produce a report with a routing matrix for GLM, MiniMax, Kimi, Codex, agy, and any Anthropic baseline used. - Do not run expensive Anthropic baselines before F1 prediction safety unless explicitly approved. - Link report in ticket comments and recommend any runtime_profile changes separately; do not auto-change production tasks from eval results without review.

Activity

done

Luci is working...

Details — Done · High · Luci ▼

State

Done

Closed

State

Priority

People

Owner (assigned to)

Controller

Timing / Details

Due Date

Snooze

Source api (human)

Project Mission Control

Created 1d ago

Updated 18h ago

Advanced / Operator evidence

Routing owner

Routes via

Operator console

Evidence

Ticket is done; runtime is closed. MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.

Workflow

Start Dev Review + QA ▾

Select phases to include:

Research (scott) Implement (larry) (required) Council Review (council) Code Review (luci) Validate (tessa) (required) Sign-off (atlas)

Agents

Review Gates

Decision

WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.

system 1d ago

[visibility-only] Queued message recorded for Hermes Luci. MC did not claim the ticket or spawn a runtime.

luci-board-manager 1d ago

Visibility-only routing update: routed this Luci-owned technical ticket to internal Kanban card t_8bca8b6d on board mc-internal, assigned to codexbuilder, workspace /home/lucienne/workspace/_mc_internal_worktrees/MC-5006-provider-eval. Verified card status=running, run=298, pid=2574826, cwd=/home/lucienne/workspace/_mc_internal_worktrees/MC-5006-provider-eval. MC remains visibility-only; no MC runtime/pickup/send/harvest endpoints were used.

luci-board-manager 1d ago

Visibility-only controller routed blocked internal handoff t_8bca8b6d to follow-up Kanban card t_f0a64871 on board mc-internal (assignee=default, workspace=/home/lucienne/workspace/_mc_internal_worktrees/MC-5006-provider-eval). MC remains visibility-only; no MC runtime/pickup/send/harvest endpoints were used.

luci-board-manager 1d ago

Visibility-only controller gate complete: internal Kanban t_f0a64871 reviewed and landed the non-Claude provider evaluation artifacts on branch kb/MC-5006-provider-eval at a0a3a8277d2baf2b1ac2287018cd6fc7411c3c48. Verified branch exists on origin, required report/script/test files are present, focused pytest passed (4 passed), and simulated non-mutating eval produced 25/25 ok rows for glm/minimax/kimi/codex/agy. No production routing/profile changes and no MC runtime/pickup/send/harvest endpoints were used.

luci 18h ago

Reran Kimi benchmark after K2.7 Code update. Run: reports/provider_eval/MC-5006-kimi-k27-rerun-20260612T112417Z.md JSON: reports/provider_eval/MC-5006-kimi-k27-rerun-20260612T112417Z.json Model identity: requested/resolved `kimi-for-coding (K2.7 Code)`. Result: 4/5 scenarios passed, parse 5/5, hard score 66/68, average latency 16.1s. - PASS Life Manager triage: 14/14 - PASS Life Manager digest: 16/16 - PASS Memory extraction: 13/13 - FAIL Ops/watchdog summary: 11/13; swapped severities (`run-2` expected high got medium; `run-3` expected medium got high) - PASS Code-review-lite: 12/12 Conclusion unchanged for alerting: Kimi K2.7 is fine for triage/digest/memory/code-review-lite style static scheduled tasks, but still not safe as the deterministic ops/watchdog alert severity gate without prompt/rubric tightening.

luci 18h ago

Stable copy saved under /home/lucienne/workspace/reports/provider_eval/MC-5006-kimi-k27-rerun-20260612T112417Z.md with raw outputs under /home/lucienne/workspace/reports/provider_eval/runs/MC-5006-kimi-k27-20260612T112417Z/.

Live ▼

No activity yet

←