You're offline — showing cached data

MC-5006

Evaluate non-Claude providers for scheduled tasks
2026-06-13 07:37:59 SAST
Home Board MC-5006

Evaluate non-Claude providers for scheduled tasks

Build and run a small, repeatable model-routing eval for Luci scheduled-task workloads. Goal: Compare GLM, MiniMax, Kimi, Codex, and agy on the types of scheduled tasks Luci ac...
State Done Next Action Closed Owner Luci Runtime Closed Age 1d ago
MC-5006
Ticket is done; runtime is closed. · profile claude_opus_1m_high

Description

MC-5006
Build and run a small, repeatable model-routing eval for Luci scheduled-task workloads. Goal: Compare GLM, MiniMax, Kimi, Codex, and agy on the types of scheduled tasks Luci actually runs, so Elmar can decide long-term routing without relying on gut feel. Keep Anthropic as a baseline only where quota allows; do not consume Anthropic unnecessarily before the F1 window finishes. Context: - Existing smoke test only checks provider liveness: scripts/provider_smoke_test.py. - Need behavior/quality eval, not just PONG. - User specifically wants scheduled-task types, not generic benchmarks. - Memory extractor already uses agy via PKA_EXTRACTION_PROVIDER=agy; include that task class but do not mislabel it as Claude. Candidate task classes / scenarios: 1. Life Manager triage scan: classify incoming email/WhatsApp snippets into urgent, FYI, action, ignore; score precision/recall against a hand-labeled fixture. 2. Life Manager digest: produce a concise Telegram-ready digest from fixture events; score completeness, brevity, actionability, no hallucinated items. 3. Memory extraction: extract durable memories from short session transcripts; score keep/drop correctness and no stale/task-progress memories. 4. Support intake / ticket routing: turn inbound support text into ticket/ignore/escalate decisions; score routing and required fields. 5. Scheduled ops/watchdog summarization: read synthetic logs/task history and decide alert vs silent; score false positives/negatives. 6. Code-review council lite: review a small fixture diff; score whether known seeded bugs are caught. 7. Research/digest synthesis: summarize a source pack into a short brief; score citation grounding and decision usefulness. Provider runners to support: - GLM via claude-provider-env.sh glm / claude CLI profile where valid. - MiniMax via claude-provider-env.sh minimax MiniMax-M3. - Kimi via Kimi direct helper where appropriate, and optionally Claude Code harness if known-good. - Codex via CLI/API runner for coding/review scenarios. - agy via agy -p for extraction/digest-style scenarios. - Optional Anthropic baseline after F1 safety window or on a tiny sample only. Eval design: - Use fixtures committed under a repo-appropriate path, e.g. tests/provider_eval/fixtures/. - For each scenario, store: prompt, input fixture, gold labels or rubric, max latency, allowed output schema. - Capture: output, latency, exit status, approximate cost/quota label, parse success. - Score hard assertions first: JSON/schema validity, required fields, no invented IDs, exact labels where possible. - Score soft rubrics second, preferably with a non-tested judge or human review summary. - Produce a Markdown report with per-provider recommendation by task class: default / acceptable fallback / avoid. Acceptance: - Harness can run a small eval suite locally on Luci without altering live scheduled tasks. - Run at least one representative scenario from each of: triage, digest, memory extraction, ops/watchdog, code review. - Produce a report with a routing matrix for GLM, MiniMax, Kimi, Codex, agy, and any Anthropic baseline used. - Do not run expensive Anthropic baselines before F1 prediction safety unless explicitly approved. - Link report in ticket comments and recommend any runtime_profile changes separately; do not auto-change production tasks from eval results without review.

Activity

done
Luci is working...
Live
No activity yet
Help