Hermes Agent code scan — patterns to borrow into Mission Control

Repo: https://github.com/NousResearch/hermes-agent
Commit: a7e7921dbc0a593027f40b571861f50a71221aec (2026-05-08, "fix(tui): trim markdown wrap spaces")
Tag context: RELEASE_v0.13.0.md ("Tenacity") — tip is one commit past v0.13.0
Scan date: 2026-05-09
Local clone: /tmp/hermes-scan/

All file:line refs below are paths under /tmp/hermes-scan/.

1. Executive summary

Hermes Kanban (hermes_cli/kanban_db.py, 4482 LOC) is the closest analogue to MC and is much more rigorous about claim safety, run history, circuit breaking, and crash detection. There are 5–6 surgical patterns we should absorb without rewriting MC.
The single most valuable borrow is the task_runs history table + atomic claim_task CAS + claim_lock/claim_expires columns. MC's tickets table has worker_pid but no claim TTL, no per-attempt run rows, and no CAS. That's the gap that lets stale worker rows accumulate.
Hermes has a unified _record_task_failure() circuit breaker with consecutive_failures, per-task max_retries override, and protocol_violation detection (worker exits clean rc=0 without calling complete/block). MC currently has failure_reason text but no auto-block. Adding a counter + threshold is ~30 LOC.
enforce_max_runtime() does SIGTERM → 5 s grace → SIGKILL on any per-task max_runtime_seconds. MC has nothing here; tasks can wedge indefinitely. Good fit for mc_pickup.py's reaper loop.
Hermes' _handle_polling_conflict() in the Telegram adapter is exactly the 409 handling ccgram needs: 3 retries × 10 s with _drain_polling_connections() between attempts before going fatal — and it explicitly names the colliding process (OpenClaw or another Hermes). Cleaner than ccgram's current "die on first 409" behavior.
Hermes' "auto-skill" curator is inactivity-triggered, weekly default, runs once and produces a REPORT.md diff — not every-25-tool-calls. Our auto-skill-evolver runs hot in-session; the Hermes design is gentler and produces audit artifacts. Worth aligning.

2. Kanban patterns worth borrowing into MC

2.1 `task_runs` history table + `current_run_id` pointer

Refs: hermes_cli/kanban_db.py:677–725 (TaskRun dataclass), :822–857 (CREATE TABLE task_runs), :1838–1865 (insert run on claim), :1051–1100 (CAS-guarded sync of current_run_id).

Hermes splits "the task" (one logical unit) from "an attempt to run it" (a row in task_runs). Each claim_task inserts a fresh task_runs row with its own claim_lock, claim_expires, worker_pid, started_at, last_heartbeat_at, and on completion writes outcome ∈ {done, gave_up, reclaimed, crashed, timed_out, spawn_failed} plus a structured summary for the next worker in the chain. The tasks.current_run_id pointer is updated under a CAS so a racing claim can't corrupt it.

How it'd work in MC: add a ticket_runs table mirroring the existing task_runs (which today is a scheduler concept, not a ticket concept). Each MC worker spawn becomes a row. Ticket state flips to in_progress only when a ticket_runs row exists; on retry we get a new row instead of overwriting worker_pid/worker_started. The dashboard's "session history" section becomes a trivial SELECT * FROM ticket_runs WHERE ticket_id=? ORDER BY started_at DESC — today MC has to grep logs.

2.2 Atomic claim with CAS in a single UPDATE

Refs: hermes_cli/kanban_db.py:1780–1866 (claim_task), :1118–1133 (write_txn IMMEDIATE wrapper).

The whole claim is one statement:

UPDATE tasks
   SET status='running', claim_lock=?, claim_expires=?,
       started_at=COALESCE(started_at, ?)
 WHERE id=? AND status='ready' AND claim_lock IS NULL

If cur.rowcount != 1, the claimer lost the race and bails. WAL + BEGIN IMMEDIATE guarantees one writer at a time; the claim_lock IS NULL guard makes the SQL itself the lock. There's no advisory file lock, no flock, no Redis.

How it'd work in MC: today mc_pickup.py:1956 (pending -> claimed) doesn't appear to use a single-statement CAS — if multiple mc_pickup invocations ever overlap (e.g. cron tick + manual restart) two workers can claim the same ticket. Replace the pickup query with the same CAS pattern, plus add claim_lock TEXT and claim_expires INTEGER columns to tickets. ~10 lines including the migration.

2.3 Stale-claim reclaim + heartbeat

Refs: hermes_cli/kanban_db.py:1869–1897 (heartbeat_claim), :1900–1944 (release_stale_claims), :98 (DEFAULT_CLAIM_TTL_SECONDS = 15*60).

15-minute claim TTL by default. Any worker that lives longer must call heartbeat_claim() to extend its lock; otherwise the next dispatcher tick reclaims the task back to ready with a reclaimed event and (importantly) sends SIGTERM/SIGKILL to the orphaned PID first (_terminate_reclaimed_worker). This is the missing piece in MC — we have heartbeats elsewhere but no contract that a stalled worker gets reaped after N minutes.

How it'd work in MC: add claim_expires + last_heartbeat_at to tickets. Run release_stale_claims() in mc_pickup.py's tick (every 60 s already). Workers call a heartbeat() helper every couple of minutes. Long-running tickets that legitimately need >15 min just declare a longer TTL on claim (ttl_seconds= arg). Today, when a worker dies silently, the ticket stays in_progress forever and we discover it manually.

2.4 Crashed-worker detection with `_classify_worker_exit`

Refs: hermes_cli/kanban_db.py:3104–3225 (detect_crashed_workers), :3146 (_classify_worker_exit).

Each tick: for every running ticket on this host, check if worker_pid is alive. If not, classify why (clean rc=0 vs nonzero vs signal). Crucial detail at :3147–3162: a worker that exited cleanly (rc=0) but the task is still running is treated as a protocol violation — i.e. the LLM "answered conversationally" without calling the terminal tool. This trips the breaker on the FIRST occurrence (failure_limit=1), not on Nth. Without this, retrying a worker whose CLI keeps exiting 0 without calling kanban_complete loops forever.

How it'd work in MC: MC's worker subprocess design is similar — Claude CLI exits, ticket stays in_progress. The same detection applies. This is independent of the TTL reclaim above (faster detection); both should run in the same tick.

2.5 Unified failure counter + `max_retries` per-ticket override

Refs: hermes_cli/kanban_db.py:3231–3401 (_record_task_failure), :3424–3439 (_clear_failure_counter).

One function handles all non-success outcomes (spawn_failed / crashed / timed_out / protocol_violation). Increments consecutive_failures; if >= effective_limit, flips ticket → blocked with last_failure_error populated and emits a gave_up event. Resolution order for the threshold:

per-task max_retries if set (the task says "try me 5 times")
caller-supplied failure_limit (config default)
DEFAULT_FAILURE_LIMIT (3)

Counter is cleared only on successful completion — not on successful spawn. Comment at :3623: "A successful spawn proves the worker can start but doesn't prove the run will succeed."

How it'd work in MC: MC has no breaker today — a busted ticket can be re-picked indefinitely. Add consecutive_failures INTEGER DEFAULT 0 and max_retries INTEGER to tickets, route every failure path through one helper, auto-flip to a new blocked status (or use existing waiting) at threshold. Telegram gave_up notification fires immediately so Elmar sees "ticket X blocked after 3 failures".

2.6 Per-task `max_runtime_seconds` with SIGTERM → SIGKILL

Refs: hermes_cli/kanban_db.py:2976–3086 (enforce_max_runtime), :3089–3101 (set_max_runtime).

For each running ticket with max_runtime_seconds set, if now - started_at > limit: SIGTERM the worker pid, poll 10× 0.5 s, SIGKILL if still alive, write a timed_out outcome with {pid, elapsed, limit, sigkill} payload, drop ticket back to ready. Importantly, runtime is measured from task_runs.started_at not tasks.started_at — so retries get a fresh budget per attempt.

How it'd work in MC: scheduled tasks already have a timeout concept in ~/workspace/tasks/, but ad-hoc MC tickets don't. Adding max_runtime_seconds per ticket lets us cap the long-tail tickets that hang on a remote SSH or a stuck Claude CLI session. ~40 LOC including the kill ladder.

2.7 Dispatcher zombie-reaper (`waitpid(-1, WNOHANG)`)

Refs: hermes_cli/kanban_db.py:3527–3538.

Without this, every _default_spawn'd worker that finishes becomes a <defunct> zombie because the dispatcher (gateway-embedded) is the parent and never waitpid()s. They linger until gateway exit. Hermes runs a WNOHANG reap loop on every tick.

How it'd work in MC: mc_pickup.py is the parent of every ticket worker. If we ever hit ulimit -u on Hetzner it'll be from accumulated zombies. Worth adding the 3-line reap loop now, before it bites.

2.8 `recompute_ready` parent-link scheduling

Refs: hermes_cli/kanban_db.py:1747–1773, task_links table at :799.

task_links(parent_id, child_id) is a DAG. todo → ready only when all parents are done. Every dispatcher tick re-evaluates promotions.

How it'd work in MC: we don't have ticket dependencies today. The "workflow children" auto-spawn that was disabled (per ~/workspace/CLAUDE.md) was a different (worse) shape — it spawned phantom child tickets with no completion semantics. A ticket_links table with parent/child + this 25-line recompute_ready would give us real dependencies without bringing back the WORKFLOW_TEMPLATES anti-pattern. Optional — only adopt if we actually want DAG tickets.

3. Reflective phase / skill auto-generation — concrete diffs vs `auto-skill-evolver`

Refs: agent/curator.py (1674 LOC), agent/skill_commands.py, tools/skill_provenance.py, tools/skill_manager_tool.py:713–790.

Architectural differences from our auto-skill-evolver:

Aspect	Hermes curator	Our auto-skill-evolver
Trigger	Inactivity + interval (default `7 days`, `min_idle_hours=2`) — `agent/curator.py:1656` `maybe_run_curator`	Every ~25 tool calls via PreToolUse hook
Scope	Only touches agent-created skills (`tools/skill_provenance.py:75` `is_background_review`) — user-written skills are immune	No origin tracking; can edit anything
Mode	LLM runs in a forked AIAgent with the auxiliary client (cheap model), not on main session prompt cache (`agent/curator.py:19`)	Runs inline using a cheap LLM call but no fork isolation
Output	Writes a per-run `REPORT.md` with before/after diff (`:1414`) — auditable	Edits a skill file; commit message is the only audit
Lifecycle	States: `active → stale (30d unused) → archived (90d unused)`. Archive is recoverable, never auto-deletes (`:17`)	Binary edit; no lifecycle
Consolidation prompt	Explicitly tells the model to build umbrella skills and absorb related ones, with `absorbed_into` pointer for traceability (`:723`, `:594`)	One-shot create-or-update
Snapshot	Pre-run snapshot of all skills (`:1320` `snapshot_skills(reason="pre-curator-run")`) so a botched curator run is reversible	None — relies on git

Concrete diffs we should consider for auto-skill-evolver:

Stop running on every-25-tool-calls. Switch to inactivity-triggered (Hermes' min_idle_hours gate). The current cadence runs when the agent is busy, which is exactly when we don't want a side-fork chewing tokens. ~agent/curator.py:1666–1670 is the gate.
Add origin tracking via ContextVar. tools/skill_provenance.py is 79 lines and gives auto-skill-evolver a hard guarantee it only modifies skills it (or its ancestors) created. We've had the bug where evolver patches user-curated skills.
Add an absorbed_into pointer when deleting. When we consolidate skill A into B, record B's name on A's tombstone so we can find where things went later. Hermes does this at tools/skill_manager_tool.py:723.
Write a REPORT.md per run. Right now we have no record of "evolver ran at 03:14 and changed these 3 skills". Hermes' report includes a before/after diff and is what hermes curator status points at.
Active/Stale/Archived lifecycle is probably overkill for our scale (~70 active skills) — defer.

4. Channel gateway patterns — what ccgram should adopt

Refs: gateway/platforms/telegram.py:604–661 (the 409 handler), :438–477 (_drain_polling_connections), :481–602 (_handle_polling_network_error), gateway/channel_directory.py.

ccgram currently dies on 409 Conflict. Hermes' design is meaningfully better and is a near-drop-in for ~/workspace/ccgram/:

4.1 Conflict retry with drained pool

_handle_polling_conflict (telegram.py:604): - Increments _polling_conflict_count. While ≤ 3: stop the updater, sleep 10 s, drain the httpx connection pool used for getUpdates, restart start_polling with drop_pending_updates=False. Reset counter on success. - On exhaustion: sets a typed fatal error telegram_polling_conflict with a human-readable message that explicitly names the most likely culprit ("possibly OpenClaw or another Hermes instance"). Notifies via _notify_fatal_error().

The drained-pool detail at :525–528 is key. PTB's underlying httpx connections will silently keep the long-poll session open server-side even after updater.stop() returns; without the drain, the very next start_polling immediately hits 409 again. ccgram likely has this hidden in our 409 reproductions.

ccgram diff: wrap the polling loop in this same retry shape. Move the "die immediately" path to "die after 3 × 10 s with explicit message". The RTM/CLAUDE.md note about mc-telegram-bridge collisions becomes a single-line log warning instead of an outage.

4.2 Network-error reconnect (separate from 409)

_handle_polling_network_error (:481–602) handles the transient case (DNS hiccup, TLS timeout) with exponential-ish backoff and a follow-up _verify_polling_after_reconnect probe that confirms getMe returns within HEARTBEAT_PROBE_DELAY. Distinguishing transient network failure from auth/conflict failure is the kind of thing we keep half-implementing.

4.3 Channel directory cache

gateway/channel_directory.py builds and refreshes (every 5 min) a JSON map of every reachable channel/contact across all platforms, written atomically to ~/.hermes/channel_directory.json. Our notify.py uses hardcoded chat IDs; this pattern would let ccgram answer "what topics/channels can I post to?" without an API round-trip every send.

ccgram diff: small — add a refresh task that snapshots known chat_id/topic_id pairs to JSON. Resolves "how do I post to a thread by name?" naturally.

5. Other load-bearing patterns

5.1 `BEGIN IMMEDIATE` everywhere writes happen

hermes_cli/kanban_db.py:1118–1133 write_txn(). Every write goes through this context manager. SQLite without BEGIN IMMEDIATE (i.e. with BEGIN/deferred) lets readers and writers race in subtle ways under WAL. MC's app.py uses raw sqlite3 connections; not all writes are wrapped. Worth a 30-min sweep to make every multi-statement write go through one helper.

5.2 Idempotency keys for ticket creation

hermes_cli/kanban_db.py:1147 "Callers that care about idempotency should pass idempotency_key to create_task rather than rely on id uniqueness." Hermes uses a 4-byte random task id (4.3B space, ~1.2e-3 collision at 100k) and explicitly tells callers to pass an idempotency key for dedup. MC has the recurring "duplicate task" problem (per CLAUDE.md key rule #6); a tickets.idempotency_key UNIQUE column with INSERT … ON CONFLICT DO NOTHING would mechanize the rule. ~5 LOC + migration.

5.3 `has_spawnable_ready()` to distinguish "stuck" from "correctly idle"

hermes_cli/kanban_db.py:3446–3475. Health telemetry needs to distinguish "0 spawned because nothing's ready" from "0 spawned because something's ready but no profile can spawn it". Exactly the question MC's heartbeat dashboard sometimes can't answer. Borrow the pattern: a ticket assigned to a non-existent worker_role gets bucketed as skipped_nonspawnable, not idle.

5.4 Pre-mutation snapshot before destructive operations

agent/curator.py:1320–1329 snapshot_skills(reason="pre-curator-run"). Best-effort; never blocks the run; logged at debug if it fails. Pattern: any automated mutator that touches >1 file takes a tarball snapshot first. Cheap insurance for auto-skill-evolver and any future ticket-bulk-edit feature.

5.5 Fail-OPEN for judges, fail-CLOSED for breakers

hermes_cli/goals.py:18 "Judge failures are fail-OPEN: continue. A broken judge must not wedge progress; the turn budget is the backstop." vs. hermes_cli/kanban_db.py:3301 (circuit breaker is fail-CLOSED — trip on threshold).

The asymmetry is deliberate and worth codifying: anything advisory (LLM-as-judge, classifier, advisory webhook) should fail open with a hard backstop; anything authoritative (failure counter, claim CAS) should fail closed. We sometimes get this backwards — e.g. a transient classifier error blocks a ticket.

6. Stuff to explicitly NOT borrow

run_agent.py (~12k LOC) and cli.py (~11k LOC). Massive monolith conversation loops with provider adapters, prompt cache, compression, etc. We don't run our own LLM loop — we shell out to claude/Codex/etc. Out of scope.
The whole gateway dispatcher process. Hermes' "gateway" is a long-running multi-platform message router that embeds the kanban dispatcher (per the deprecated standalone systemd unit at plugins/kanban/systemd/hermes-kanban-dispatcher.service). MC already has its own dispatch model (scheduler.py tick from cron + mc_pickup.py); merging into a single long-running process would be a rewrite, not a borrow.
agent/credential_pool.py (66.9 K) and agent/anthropic_adapter.py (86 K) and friends. Provider-key rotation, slot-aware credential routing — we use provider-switch and the Claude wrapper for this; their model is heavier and tuned for paying-customer multi-account setups.
Curator's full active/stale/archived lifecycle. Designed for installs with hundreds of agent-generated skills. We have ~70 total. Adopt the trigger and report and origin tracking; skip the lifecycle states until our skill count grows materially.
tinker-atropos/, tui_gateway/server.py (231 K), ui-tui/. Their TUI/training stack. Out of scope.
The deprecated standalone dispatcher unit. The repo itself warns against it; gateway-embedded is now the intended path. We just want the algorithm (dispatch_once + helpers), not the service shape.
Per-task auxiliary LLM "specifier" (hermes_cli/kanban_specify.py). Calls a cheap LLM to flesh out a one-line ticket into goal/approach/acceptance criteria. Tempting, but Lucienne already does this on the Mac side when filing tickets to MC. Adding a second AI inflater on the Hetzner side would cause drift. Skip.

Recommended first surgical PR (if we cherry-pick anything)

If we do exactly one thing from this scan:

Add claim_lock TEXT, claim_expires INTEGER, consecutive_failures INTEGER DEFAULT 0, max_retries INTEGER, last_heartbeat_at TEXT, last_failure_error TEXT columns to tickets.
Add a ticket_runs table mirroring kanban's task_runs shape.
Replace mc_pickup.py's pickup query with the single-statement CAS.
Add release_stale_claims(), detect_crashed_workers(), enforce_max_runtime(), _record_task_failure() ports of the kanban_db functions, calling them from the existing tick.
Wire gave_up events to a Telegram notification.

That's roughly 200–250 LOC plus a migration, and gets us atomic claims, TTL reclaim, crash detection, runtime caps, circuit breaking, and a usable run history — basically everything MC is currently missing on the worker-lifecycle side. Everything else in this report is optional after that.