a7e7921dbc0a593027f40b571861f50a71221aec (2026-05-08, "fix(tui): trim markdown wrap spaces")/tmp/hermes-scan/All file:line refs below are paths under /tmp/hermes-scan/.
hermes_cli/kanban_db.py, 4482 LOC) is the closest analogue to MC and is much more rigorous about claim safety, run history, circuit breaking, and crash detection. There are 5–6 surgical patterns we should absorb without rewriting MC.task_runs history table + atomic claim_task CAS + claim_lock/claim_expires columns. MC's tickets table has worker_pid but no claim TTL, no per-attempt run rows, and no CAS. That's the gap that lets stale worker rows accumulate._record_task_failure() circuit breaker with consecutive_failures, per-task max_retries override, and protocol_violation detection (worker exits clean rc=0 without calling complete/block). MC currently has failure_reason text but no auto-block. Adding a counter + threshold is ~30 LOC.enforce_max_runtime() does SIGTERM → 5 s grace → SIGKILL on any per-task max_runtime_seconds. MC has nothing here; tasks can wedge indefinitely. Good fit for mc_pickup.py's reaper loop._handle_polling_conflict() in the Telegram adapter is exactly the 409 handling ccgram needs: 3 retries × 10 s with _drain_polling_connections() between attempts before going fatal — and it explicitly names the colliding process (OpenClaw or another Hermes). Cleaner than ccgram's current "die on first 409" behavior.auto-skill-evolver runs hot in-session; the Hermes design is gentler and produces audit artifacts. Worth aligning.task_runs history table + current_run_id pointerRefs: hermes_cli/kanban_db.py:677–725 (TaskRun dataclass), :822–857 (CREATE TABLE task_runs), :1838–1865 (insert run on claim), :1051–1100 (CAS-guarded sync of current_run_id).
Hermes splits "the task" (one logical unit) from "an attempt to run it" (a row in task_runs). Each claim_task inserts a fresh task_runs row with its own claim_lock, claim_expires, worker_pid, started_at, last_heartbeat_at, and on completion writes outcome ∈ {done, gave_up, reclaimed, crashed, timed_out, spawn_failed} plus a structured summary for the next worker in the chain. The tasks.current_run_id pointer is updated under a CAS so a racing claim can't corrupt it.
How it'd work in MC: add a ticket_runs table mirroring the existing task_runs (which today is a scheduler concept, not a ticket concept). Each MC worker spawn becomes a row. Ticket state flips to in_progress only when a ticket_runs row exists; on retry we get a new row instead of overwriting worker_pid/worker_started. The dashboard's "session history" section becomes a trivial SELECT * FROM ticket_runs WHERE ticket_id=? ORDER BY started_at DESC — today MC has to grep logs.
Refs: hermes_cli/kanban_db.py:1780–1866 (claim_task), :1118–1133 (write_txn IMMEDIATE wrapper).
The whole claim is one statement:
UPDATE tasks
SET status='running', claim_lock=?, claim_expires=?,
started_at=COALESCE(started_at, ?)
WHERE id=? AND status='ready' AND claim_lock IS NULL
If cur.rowcount != 1, the claimer lost the race and bails. WAL + BEGIN IMMEDIATE guarantees one writer at a time; the claim_lock IS NULL guard makes the SQL itself the lock. There's no advisory file lock, no flock, no Redis.
How it'd work in MC: today mc_pickup.py:1956 (pending -> claimed) doesn't appear to use a single-statement CAS — if multiple mc_pickup invocations ever overlap (e.g. cron tick + manual restart) two workers can claim the same ticket. Replace the pickup query with the same CAS pattern, plus add claim_lock TEXT and claim_expires INTEGER columns to tickets. ~10 lines including the migration.
Refs: hermes_cli/kanban_db.py:1869–1897 (heartbeat_claim), :1900–1944 (release_stale_claims), :98 (DEFAULT_CLAIM_TTL_SECONDS = 15*60).
15-minute claim TTL by default. Any worker that lives longer must call heartbeat_claim() to extend its lock; otherwise the next dispatcher tick reclaims the task back to ready with a reclaimed event and (importantly) sends SIGTERM/SIGKILL to the orphaned PID first (_terminate_reclaimed_worker). This is the missing piece in MC — we have heartbeats elsewhere but no contract that a stalled worker gets reaped after N minutes.
How it'd work in MC: add claim_expires + last_heartbeat_at to tickets. Run release_stale_claims() in mc_pickup.py's tick (every 60 s already). Workers call a heartbeat() helper every couple of minutes. Long-running tickets that legitimately need >15 min just declare a longer TTL on claim (ttl_seconds= arg). Today, when a worker dies silently, the ticket stays in_progress forever and we discover it manually.
_classify_worker_exitRefs: hermes_cli/kanban_db.py:3104–3225 (detect_crashed_workers), :3146 (_classify_worker_exit).
Each tick: for every running ticket on this host, check if worker_pid is alive. If not, classify why (clean rc=0 vs nonzero vs signal). Crucial detail at :3147–3162: a worker that exited cleanly (rc=0) but the task is still running is treated as a protocol violation — i.e. the LLM "answered conversationally" without calling the terminal tool. This trips the breaker on the FIRST occurrence (failure_limit=1), not on Nth. Without this, retrying a worker whose CLI keeps exiting 0 without calling kanban_complete loops forever.
How it'd work in MC: MC's worker subprocess design is similar — Claude CLI exits, ticket stays in_progress. The same detection applies. This is independent of the TTL reclaim above (faster detection); both should run in the same tick.
max_retries per-ticket overrideRefs: hermes_cli/kanban_db.py:3231–3401 (_record_task_failure), :3424–3439 (_clear_failure_counter).
One function handles all non-success outcomes (spawn_failed / crashed / timed_out / protocol_violation). Increments consecutive_failures; if >= effective_limit, flips ticket → blocked with last_failure_error populated and emits a gave_up event. Resolution order for the threshold:
max_retries if set (the task says "try me 5 times")failure_limit (config default)DEFAULT_FAILURE_LIMIT (3)Counter is cleared only on successful completion — not on successful spawn. Comment at :3623: "A successful spawn proves the worker can start but doesn't prove the run will succeed."
How it'd work in MC: MC has no breaker today — a busted ticket can be re-picked indefinitely. Add consecutive_failures INTEGER DEFAULT 0 and max_retries INTEGER to tickets, route every failure path through one helper, auto-flip to a new blocked status (or use existing waiting) at threshold. Telegram gave_up notification fires immediately so Elmar sees "ticket X blocked after 3 failures".
max_runtime_seconds with SIGTERM → SIGKILLRefs: hermes_cli/kanban_db.py:2976–3086 (enforce_max_runtime), :3089–3101 (set_max_runtime).
For each running ticket with max_runtime_seconds set, if now - started_at > limit: SIGTERM the worker pid, poll 10× 0.5 s, SIGKILL if still alive, write a timed_out outcome with {pid, elapsed, limit, sigkill} payload, drop ticket back to ready. Importantly, runtime is measured from task_runs.started_at not tasks.started_at — so retries get a fresh budget per attempt.
How it'd work in MC: scheduled tasks already have a timeout concept in ~/workspace/tasks/, but ad-hoc MC tickets don't. Adding max_runtime_seconds per ticket lets us cap the long-tail tickets that hang on a remote SSH or a stuck Claude CLI session. ~40 LOC including the kill ladder.
waitpid(-1, WNOHANG))Refs: hermes_cli/kanban_db.py:3527–3538.
Without this, every _default_spawn'd worker that finishes becomes a <defunct> zombie because the dispatcher (gateway-embedded) is the parent and never waitpid()s. They linger until gateway exit. Hermes runs a WNOHANG reap loop on every tick.
How it'd work in MC: mc_pickup.py is the parent of every ticket worker. If we ever hit ulimit -u on Hetzner it'll be from accumulated zombies. Worth adding the 3-line reap loop now, before it bites.
recompute_ready parent-link schedulingRefs: hermes_cli/kanban_db.py:1747–1773, task_links table at :799.
task_links(parent_id, child_id) is a DAG. todo → ready only when all parents are done. Every dispatcher tick re-evaluates promotions.
How it'd work in MC: we don't have ticket dependencies today. The "workflow children" auto-spawn that was disabled (per ~/workspace/CLAUDE.md) was a different (worse) shape — it spawned phantom child tickets with no completion semantics. A ticket_links table with parent/child + this 25-line recompute_ready would give us real dependencies without bringing back the WORKFLOW_TEMPLATES anti-pattern. Optional — only adopt if we actually want DAG tickets.
auto-skill-evolverRefs: agent/curator.py (1674 LOC), agent/skill_commands.py, tools/skill_provenance.py, tools/skill_manager_tool.py:713–790.
Architectural differences from our auto-skill-evolver:
| Aspect | Hermes curator | Our auto-skill-evolver |
|---|---|---|
| Trigger | Inactivity + interval (default 7 days, min_idle_hours=2) — agent/curator.py:1656 maybe_run_curator |
Every ~25 tool calls via PreToolUse hook |
| Scope | Only touches agent-created skills (tools/skill_provenance.py:75 is_background_review) — user-written skills are immune |
No origin tracking; can edit anything |
| Mode | LLM runs in a forked AIAgent with the auxiliary client (cheap model), not on main session prompt cache (agent/curator.py:19) |
Runs inline using a cheap LLM call but no fork isolation |
| Output | Writes a per-run REPORT.md with before/after diff (:1414) — auditable |
Edits a skill file; commit message is the only audit |
| Lifecycle | States: active → stale (30d unused) → archived (90d unused). Archive is recoverable, never auto-deletes (:17) |
Binary edit; no lifecycle |
| Consolidation prompt | Explicitly tells the model to build umbrella skills and absorb related ones, with absorbed_into pointer for traceability (:723, :594) |
One-shot create-or-update |
| Snapshot | Pre-run snapshot of all skills (:1320 snapshot_skills(reason="pre-curator-run")) so a botched curator run is reversible |
None — relies on git |
Concrete diffs we should consider for auto-skill-evolver:
min_idle_hours gate). The current cadence runs when the agent is busy, which is exactly when we don't want a side-fork chewing tokens. ~agent/curator.py:1666–1670 is the gate.tools/skill_provenance.py is 79 lines and gives auto-skill-evolver a hard guarantee it only modifies skills it (or its ancestors) created. We've had the bug where evolver patches user-curated skills.absorbed_into pointer when deleting. When we consolidate skill A into B, record B's name on A's tombstone so we can find where things went later. Hermes does this at tools/skill_manager_tool.py:723.hermes curator status points at.Refs: gateway/platforms/telegram.py:604–661 (the 409 handler), :438–477 (_drain_polling_connections), :481–602 (_handle_polling_network_error), gateway/channel_directory.py.
ccgram currently dies on 409 Conflict. Hermes' design is meaningfully better and is a near-drop-in for ~/workspace/ccgram/:
_handle_polling_conflict (telegram.py:604):
- Increments _polling_conflict_count. While ≤ 3: stop the updater, sleep 10 s, drain the httpx connection pool used for getUpdates, restart start_polling with drop_pending_updates=False. Reset counter on success.
- On exhaustion: sets a typed fatal error telegram_polling_conflict with a human-readable message that explicitly names the most likely culprit ("possibly OpenClaw or another Hermes instance"). Notifies via _notify_fatal_error().
The drained-pool detail at :525–528 is key. PTB's underlying httpx connections will silently keep the long-poll session open server-side even after updater.stop() returns; without the drain, the very next start_polling immediately hits 409 again. ccgram likely has this hidden in our 409 reproductions.
ccgram diff: wrap the polling loop in this same retry shape. Move the "die immediately" path to "die after 3 × 10 s with explicit message". The RTM/CLAUDE.md note about mc-telegram-bridge collisions becomes a single-line log warning instead of an outage.
_handle_polling_network_error (:481–602) handles the transient case (DNS hiccup, TLS timeout) with exponential-ish backoff and a follow-up _verify_polling_after_reconnect probe that confirms getMe returns within HEARTBEAT_PROBE_DELAY. Distinguishing transient network failure from auth/conflict failure is the kind of thing we keep half-implementing.
gateway/channel_directory.py builds and refreshes (every 5 min) a JSON map of every reachable channel/contact across all platforms, written atomically to ~/.hermes/channel_directory.json. Our notify.py uses hardcoded chat IDs; this pattern would let ccgram answer "what topics/channels can I post to?" without an API round-trip every send.
ccgram diff: small — add a refresh task that snapshots known chat_id/topic_id pairs to JSON. Resolves "how do I post to a thread by name?" naturally.
BEGIN IMMEDIATE everywhere writes happenhermes_cli/kanban_db.py:1118–1133 write_txn(). Every write goes through this context manager. SQLite without BEGIN IMMEDIATE (i.e. with BEGIN/deferred) lets readers and writers race in subtle ways under WAL. MC's app.py uses raw sqlite3 connections; not all writes are wrapped. Worth a 30-min sweep to make every multi-statement write go through one helper.
hermes_cli/kanban_db.py:1147 "Callers that care about idempotency should pass idempotency_key to create_task rather than rely on id uniqueness." Hermes uses a 4-byte random task id (4.3B space, ~1.2e-3 collision at 100k) and explicitly tells callers to pass an idempotency key for dedup. MC has the recurring "duplicate task" problem (per CLAUDE.md key rule #6); a tickets.idempotency_key UNIQUE column with INSERT … ON CONFLICT DO NOTHING would mechanize the rule. ~5 LOC + migration.
has_spawnable_ready() to distinguish "stuck" from "correctly idle"hermes_cli/kanban_db.py:3446–3475. Health telemetry needs to distinguish "0 spawned because nothing's ready" from "0 spawned because something's ready but no profile can spawn it". Exactly the question MC's heartbeat dashboard sometimes can't answer. Borrow the pattern: a ticket assigned to a non-existent worker_role gets bucketed as skipped_nonspawnable, not idle.
agent/curator.py:1320–1329 snapshot_skills(reason="pre-curator-run"). Best-effort; never blocks the run; logged at debug if it fails. Pattern: any automated mutator that touches >1 file takes a tarball snapshot first. Cheap insurance for auto-skill-evolver and any future ticket-bulk-edit feature.
hermes_cli/goals.py:18 "Judge failures are fail-OPEN: continue. A broken judge must not wedge progress; the turn budget is the backstop."
vs.
hermes_cli/kanban_db.py:3301 (circuit breaker is fail-CLOSED — trip on threshold).
The asymmetry is deliberate and worth codifying: anything advisory (LLM-as-judge, classifier, advisory webhook) should fail open with a hard backstop; anything authoritative (failure counter, claim CAS) should fail closed. We sometimes get this backwards — e.g. a transient classifier error blocks a ticket.
run_agent.py (~12k LOC) and cli.py (~11k LOC). Massive monolith conversation loops with provider adapters, prompt cache, compression, etc. We don't run our own LLM loop — we shell out to claude/Codex/etc. Out of scope.plugins/kanban/systemd/hermes-kanban-dispatcher.service). MC already has its own dispatch model (scheduler.py tick from cron + mc_pickup.py); merging into a single long-running process would be a rewrite, not a borrow.agent/credential_pool.py (66.9 K) and agent/anthropic_adapter.py (86 K) and friends. Provider-key rotation, slot-aware credential routing — we use provider-switch and the Claude wrapper for this; their model is heavier and tuned for paying-customer multi-account setups.tinker-atropos/, tui_gateway/server.py (231 K), ui-tui/. Their TUI/training stack. Out of scope.hermes_cli/kanban_specify.py). Calls a cheap LLM to flesh out a one-line ticket into goal/approach/acceptance criteria. Tempting, but Lucienne already does this on the Mac side when filing tickets to MC. Adding a second AI inflater on the Hetzner side would cause drift. Skip.If we do exactly one thing from this scan:
claim_lock TEXT, claim_expires INTEGER, consecutive_failures INTEGER DEFAULT 0, max_retries INTEGER, last_heartbeat_at TEXT, last_failure_error TEXT columns to tickets.ticket_runs table mirroring kanban's task_runs shape.mc_pickup.py's pickup query with the single-statement CAS.release_stale_claims(), detect_crashed_workers(), enforce_max_runtime(), _record_task_failure() ports of the kanban_db functions, calling them from the existing tick.gave_up events to a Telegram notification.That's roughly 200–250 LOC plus a migration, and gets us atomic claims, TTL reclaim, crash detection, runtime caps, circuit breaking, and a usable run history — basically everything MC is currently missing on the worker-lifecycle side. Everything else in this report is optional after that.