Build and run a small, repeatable model-routing eval for Luci scheduled-task workloads. Goal: Compare GLM, MiniMax, Kimi, Codex, and agy on the types of scheduled tasks Luci ac...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge1d ago
Ticket is done; runtime is closed.·profile claude_opus_1m_high
Description
MC-5006
Build and run a small, repeatable model-routing eval for Luci scheduled-task workloads.
Goal:
Compare GLM, MiniMax, Kimi, Codex, and agy on the types of scheduled tasks Luci actually runs, so Elmar can decide long-term routing without relying on gut feel. Keep Anthropic as a baseline only where quota allows; do not consume Anthropic unnecessarily before the F1 window finishes.
Context:
- Existing smoke test only checks provider liveness: scripts/provider_smoke_test.py.
- Need behavior/quality eval, not just PONG.
- User specifically wants scheduled-task types, not generic benchmarks.
- Memory extractor already uses agy via PKA_EXTRACTION_PROVIDER=agy; include that task class but do not mislabel it as Claude.
Candidate task classes / scenarios:
1. Life Manager triage scan: classify incoming email/WhatsApp snippets into urgent, FYI, action, ignore; score precision/recall against a hand-labeled fixture.
2. Life Manager digest: produce a concise Telegram-ready digest from fixture events; score completeness, brevity, actionability, no hallucinated items.
3. Memory extraction: extract durable memories from short session transcripts; score keep/drop correctness and no stale/task-progress memories.
4. Support intake / ticket routing: turn inbound support text into ticket/ignore/escalate decisions; score routing and required fields.
5. Scheduled ops/watchdog summarization: read synthetic logs/task history and decide alert vs silent; score false positives/negatives.
6. Code-review council lite: review a small fixture diff; score whether known seeded bugs are caught.
7. Research/digest synthesis: summarize a source pack into a short brief; score citation grounding and decision usefulness.
Provider runners to support:
- GLM via claude-provider-env.sh glm / claude CLI profile where valid.
- MiniMax via claude-provider-env.sh minimax MiniMax-M3.
- Kimi via Kimi direct helper where appropriate, and optionally Claude Code harness if known-good.
- Codex via CLI/API runner for coding/review scenarios.
- agy via agy -p for extraction/digest-style scenarios.
- Optional Anthropic baseline after F1 safety window or on a tiny sample only.
Eval design:
- Use fixtures committed under a repo-appropriate path, e.g. tests/provider_eval/fixtures/.
- For each scenario, store: prompt, input fixture, gold labels or rubric, max latency, allowed output schema.
- Capture: output, latency, exit status, approximate cost/quota label, parse success.
- Score hard assertions first: JSON/schema validity, required fields, no invented IDs, exact labels where possible.
- Score soft rubrics second, preferably with a non-tested judge or human review summary.
- Produce a Markdown report with per-provider recommendation by task class: default / acceptable fallback / avoid.
Acceptance:
- Harness can run a small eval suite locally on Luci without altering live scheduled tasks.
- Run at least one representative scenario from each of: triage, digest, memory extraction, ops/watchdog, code review.
- Produce a report with a routing matrix for GLM, MiniMax, Kimi, Codex, agy, and any Anthropic baseline used.
- Do not run expensive Anthropic baselines before F1 prediction safety unless explicitly approved.
- Link report in ticket comments and recommend any runtime_profile changes separately; do not auto-change production tasks from eval results without review.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· High
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
api (human)
Mission Control
1d ago
18h ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
[visibility-only] Queued message recorded for Hermes Luci. MC did not claim the ticket or spawn a runtime.
luci-board-manager1d ago
Visibility-only routing update: routed this Luci-owned technical ticket to internal Kanban card t_8bca8b6d on board mc-internal, assigned to codexbuilder, workspace /home/lucienne/workspace/_mc_internal_worktrees/MC-5006-provider-eval. Verified card status=running, run=298, pid=2574826, cwd=/home/lucienne/workspace/_mc_internal_worktrees/MC-5006-provider-eval. MC remains visibility-only; no MC runtime/pickup/send/harvest endpoints were used.
luci-board-manager1d ago
Visibility-only controller routed blocked internal handoff t_8bca8b6d to follow-up Kanban card t_f0a64871 on board mc-internal (assignee=default, workspace=/home/lucienne/workspace/_mc_internal_worktrees/MC-5006-provider-eval). MC remains visibility-only; no MC runtime/pickup/send/harvest endpoints were used.
luci-board-manager1d ago
Visibility-only controller gate complete: internal Kanban t_f0a64871 reviewed and landed the non-Claude provider evaluation artifacts on branch kb/MC-5006-provider-eval at a0a3a8277d2baf2b1ac2287018cd6fc7411c3c48. Verified branch exists on origin, required report/script/test files are present, focused pytest passed (4 passed), and simulated non-mutating eval produced 25/25 ok rows for glm/minimax/kimi/codex/agy. No production routing/profile changes and no MC runtime/pickup/send/harvest endpoints were used.
luci18h ago
Reran Kimi benchmark after K2.7 Code update.
Run: reports/provider_eval/MC-5006-kimi-k27-rerun-20260612T112417Z.md
JSON: reports/provider_eval/MC-5006-kimi-k27-rerun-20260612T112417Z.json
Model identity: requested/resolved `kimi-for-coding (K2.7 Code)`.
Result: 4/5 scenarios passed, parse 5/5, hard score 66/68, average latency 16.1s.
- PASS Life Manager triage: 14/14
- PASS Life Manager digest: 16/16
- PASS Memory extraction: 13/13
- FAIL Ops/watchdog summary: 11/13; swapped severities (`run-2` expected high got medium; `run-3` expected medium got high)
- PASS Code-review-lite: 12/12
Conclusion unchanged for alerting: Kimi K2.7 is fine for triage/digest/memory/code-review-lite style static scheduled tasks, but still not safe as the deterministic ops/watchdog alert severity gate without prompt/rubric tightening.
luci18h ago
Stable copy saved under /home/lucienne/workspace/reports/provider_eval/MC-5006-kimi-k27-rerun-20260612T112417Z.md with raw outputs under /home/lucienne/workspace/reports/provider_eval/runs/MC-5006-kimi-k27-20260612T112417Z/.