⌂ Home ☷ Board

Hermes Agent code scan — patterns to borrow into Mission Control

All file:line refs below are paths under /tmp/hermes-scan/.


1. Executive summary


2. Kanban patterns worth borrowing into MC

2.1 task_runs history table + current_run_id pointer

Refs: hermes_cli/kanban_db.py:677–725 (TaskRun dataclass), :822–857 (CREATE TABLE task_runs), :1838–1865 (insert run on claim), :1051–1100 (CAS-guarded sync of current_run_id).

Hermes splits "the task" (one logical unit) from "an attempt to run it" (a row in task_runs). Each claim_task inserts a fresh task_runs row with its own claim_lock, claim_expires, worker_pid, started_at, last_heartbeat_at, and on completion writes outcome ∈ {done, gave_up, reclaimed, crashed, timed_out, spawn_failed} plus a structured summary for the next worker in the chain. The tasks.current_run_id pointer is updated under a CAS so a racing claim can't corrupt it.

How it'd work in MC: add a ticket_runs table mirroring the existing task_runs (which today is a scheduler concept, not a ticket concept). Each MC worker spawn becomes a row. Ticket state flips to in_progress only when a ticket_runs row exists; on retry we get a new row instead of overwriting worker_pid/worker_started. The dashboard's "session history" section becomes a trivial SELECT * FROM ticket_runs WHERE ticket_id=? ORDER BY started_at DESC — today MC has to grep logs.

2.2 Atomic claim with CAS in a single UPDATE

Refs: hermes_cli/kanban_db.py:1780–1866 (claim_task), :1118–1133 (write_txn IMMEDIATE wrapper).

The whole claim is one statement:

UPDATE tasks
   SET status='running', claim_lock=?, claim_expires=?,
       started_at=COALESCE(started_at, ?)
 WHERE id=? AND status='ready' AND claim_lock IS NULL

If cur.rowcount != 1, the claimer lost the race and bails. WAL + BEGIN IMMEDIATE guarantees one writer at a time; the claim_lock IS NULL guard makes the SQL itself the lock. There's no advisory file lock, no flock, no Redis.

How it'd work in MC: today mc_pickup.py:1956 (pending -> claimed) doesn't appear to use a single-statement CAS — if multiple mc_pickup invocations ever overlap (e.g. cron tick + manual restart) two workers can claim the same ticket. Replace the pickup query with the same CAS pattern, plus add claim_lock TEXT and claim_expires INTEGER columns to tickets. ~10 lines including the migration.

2.3 Stale-claim reclaim + heartbeat

Refs: hermes_cli/kanban_db.py:1869–1897 (heartbeat_claim), :1900–1944 (release_stale_claims), :98 (DEFAULT_CLAIM_TTL_SECONDS = 15*60).

15-minute claim TTL by default. Any worker that lives longer must call heartbeat_claim() to extend its lock; otherwise the next dispatcher tick reclaims the task back to ready with a reclaimed event and (importantly) sends SIGTERM/SIGKILL to the orphaned PID first (_terminate_reclaimed_worker). This is the missing piece in MC — we have heartbeats elsewhere but no contract that a stalled worker gets reaped after N minutes.

How it'd work in MC: add claim_expires + last_heartbeat_at to tickets. Run release_stale_claims() in mc_pickup.py's tick (every 60 s already). Workers call a heartbeat() helper every couple of minutes. Long-running tickets that legitimately need >15 min just declare a longer TTL on claim (ttl_seconds= arg). Today, when a worker dies silently, the ticket stays in_progress forever and we discover it manually.

2.4 Crashed-worker detection with _classify_worker_exit

Refs: hermes_cli/kanban_db.py:3104–3225 (detect_crashed_workers), :3146 (_classify_worker_exit).

Each tick: for every running ticket on this host, check if worker_pid is alive. If not, classify why (clean rc=0 vs nonzero vs signal). Crucial detail at :3147–3162: a worker that exited cleanly (rc=0) but the task is still running is treated as a protocol violation — i.e. the LLM "answered conversationally" without calling the terminal tool. This trips the breaker on the FIRST occurrence (failure_limit=1), not on Nth. Without this, retrying a worker whose CLI keeps exiting 0 without calling kanban_complete loops forever.

How it'd work in MC: MC's worker subprocess design is similar — Claude CLI exits, ticket stays in_progress. The same detection applies. This is independent of the TTL reclaim above (faster detection); both should run in the same tick.

2.5 Unified failure counter + max_retries per-ticket override

Refs: hermes_cli/kanban_db.py:3231–3401 (_record_task_failure), :3424–3439 (_clear_failure_counter).

One function handles all non-success outcomes (spawn_failed / crashed / timed_out / protocol_violation). Increments consecutive_failures; if >= effective_limit, flips ticket → blocked with last_failure_error populated and emits a gave_up event. Resolution order for the threshold:

  1. per-task max_retries if set (the task says "try me 5 times")
  2. caller-supplied failure_limit (config default)
  3. DEFAULT_FAILURE_LIMIT (3)

Counter is cleared only on successful completion — not on successful spawn. Comment at :3623: "A successful spawn proves the worker can start but doesn't prove the run will succeed."

How it'd work in MC: MC has no breaker today — a busted ticket can be re-picked indefinitely. Add consecutive_failures INTEGER DEFAULT 0 and max_retries INTEGER to tickets, route every failure path through one helper, auto-flip to a new blocked status (or use existing waiting) at threshold. Telegram gave_up notification fires immediately so Elmar sees "ticket X blocked after 3 failures".

2.6 Per-task max_runtime_seconds with SIGTERM → SIGKILL

Refs: hermes_cli/kanban_db.py:2976–3086 (enforce_max_runtime), :3089–3101 (set_max_runtime).

For each running ticket with max_runtime_seconds set, if now - started_at > limit: SIGTERM the worker pid, poll 10× 0.5 s, SIGKILL if still alive, write a timed_out outcome with {pid, elapsed, limit, sigkill} payload, drop ticket back to ready. Importantly, runtime is measured from task_runs.started_at not tasks.started_at — so retries get a fresh budget per attempt.

How it'd work in MC: scheduled tasks already have a timeout concept in ~/workspace/tasks/, but ad-hoc MC tickets don't. Adding max_runtime_seconds per ticket lets us cap the long-tail tickets that hang on a remote SSH or a stuck Claude CLI session. ~40 LOC including the kill ladder.

2.7 Dispatcher zombie-reaper (waitpid(-1, WNOHANG))

Refs: hermes_cli/kanban_db.py:3527–3538.

Without this, every _default_spawn'd worker that finishes becomes a <defunct> zombie because the dispatcher (gateway-embedded) is the parent and never waitpid()s. They linger until gateway exit. Hermes runs a WNOHANG reap loop on every tick.

How it'd work in MC: mc_pickup.py is the parent of every ticket worker. If we ever hit ulimit -u on Hetzner it'll be from accumulated zombies. Worth adding the 3-line reap loop now, before it bites.

2.8 recompute_ready parent-link scheduling

Refs: hermes_cli/kanban_db.py:1747–1773, task_links table at :799.

task_links(parent_id, child_id) is a DAG. todo → ready only when all parents are done. Every dispatcher tick re-evaluates promotions.

How it'd work in MC: we don't have ticket dependencies today. The "workflow children" auto-spawn that was disabled (per ~/workspace/CLAUDE.md) was a different (worse) shape — it spawned phantom child tickets with no completion semantics. A ticket_links table with parent/child + this 25-line recompute_ready would give us real dependencies without bringing back the WORKFLOW_TEMPLATES anti-pattern. Optional — only adopt if we actually want DAG tickets.


3. Reflective phase / skill auto-generation — concrete diffs vs auto-skill-evolver

Refs: agent/curator.py (1674 LOC), agent/skill_commands.py, tools/skill_provenance.py, tools/skill_manager_tool.py:713–790.

Architectural differences from our auto-skill-evolver:

Aspect Hermes curator Our auto-skill-evolver
Trigger Inactivity + interval (default 7 days, min_idle_hours=2) — agent/curator.py:1656 maybe_run_curator Every ~25 tool calls via PreToolUse hook
Scope Only touches agent-created skills (tools/skill_provenance.py:75 is_background_review) — user-written skills are immune No origin tracking; can edit anything
Mode LLM runs in a forked AIAgent with the auxiliary client (cheap model), not on main session prompt cache (agent/curator.py:19) Runs inline using a cheap LLM call but no fork isolation
Output Writes a per-run REPORT.md with before/after diff (:1414) — auditable Edits a skill file; commit message is the only audit
Lifecycle States: active → stale (30d unused) → archived (90d unused). Archive is recoverable, never auto-deletes (:17) Binary edit; no lifecycle
Consolidation prompt Explicitly tells the model to build umbrella skills and absorb related ones, with absorbed_into pointer for traceability (:723, :594) One-shot create-or-update
Snapshot Pre-run snapshot of all skills (:1320 snapshot_skills(reason="pre-curator-run")) so a botched curator run is reversible None — relies on git

Concrete diffs we should consider for auto-skill-evolver:

  1. Stop running on every-25-tool-calls. Switch to inactivity-triggered (Hermes' min_idle_hours gate). The current cadence runs when the agent is busy, which is exactly when we don't want a side-fork chewing tokens. ~agent/curator.py:1666–1670 is the gate.
  2. Add origin tracking via ContextVar. tools/skill_provenance.py is 79 lines and gives auto-skill-evolver a hard guarantee it only modifies skills it (or its ancestors) created. We've had the bug where evolver patches user-curated skills.
  3. Add an absorbed_into pointer when deleting. When we consolidate skill A into B, record B's name on A's tombstone so we can find where things went later. Hermes does this at tools/skill_manager_tool.py:723.
  4. Write a REPORT.md per run. Right now we have no record of "evolver ran at 03:14 and changed these 3 skills". Hermes' report includes a before/after diff and is what hermes curator status points at.
  5. Active/Stale/Archived lifecycle is probably overkill for our scale (~70 active skills) — defer.

4. Channel gateway patterns — what ccgram should adopt

Refs: gateway/platforms/telegram.py:604–661 (the 409 handler), :438–477 (_drain_polling_connections), :481–602 (_handle_polling_network_error), gateway/channel_directory.py.

ccgram currently dies on 409 Conflict. Hermes' design is meaningfully better and is a near-drop-in for ~/workspace/ccgram/:

4.1 Conflict retry with drained pool

_handle_polling_conflict (telegram.py:604): - Increments _polling_conflict_count. While ≤ 3: stop the updater, sleep 10 s, drain the httpx connection pool used for getUpdates, restart start_polling with drop_pending_updates=False. Reset counter on success. - On exhaustion: sets a typed fatal error telegram_polling_conflict with a human-readable message that explicitly names the most likely culprit ("possibly OpenClaw or another Hermes instance"). Notifies via _notify_fatal_error().

The drained-pool detail at :525–528 is key. PTB's underlying httpx connections will silently keep the long-poll session open server-side even after updater.stop() returns; without the drain, the very next start_polling immediately hits 409 again. ccgram likely has this hidden in our 409 reproductions.

ccgram diff: wrap the polling loop in this same retry shape. Move the "die immediately" path to "die after 3 × 10 s with explicit message". The RTM/CLAUDE.md note about mc-telegram-bridge collisions becomes a single-line log warning instead of an outage.

4.2 Network-error reconnect (separate from 409)

_handle_polling_network_error (:481–602) handles the transient case (DNS hiccup, TLS timeout) with exponential-ish backoff and a follow-up _verify_polling_after_reconnect probe that confirms getMe returns within HEARTBEAT_PROBE_DELAY. Distinguishing transient network failure from auth/conflict failure is the kind of thing we keep half-implementing.

4.3 Channel directory cache

gateway/channel_directory.py builds and refreshes (every 5 min) a JSON map of every reachable channel/contact across all platforms, written atomically to ~/.hermes/channel_directory.json. Our notify.py uses hardcoded chat IDs; this pattern would let ccgram answer "what topics/channels can I post to?" without an API round-trip every send.

ccgram diff: small — add a refresh task that snapshots known chat_id/topic_id pairs to JSON. Resolves "how do I post to a thread by name?" naturally.


5. Other load-bearing patterns

5.1 BEGIN IMMEDIATE everywhere writes happen

hermes_cli/kanban_db.py:1118–1133 write_txn(). Every write goes through this context manager. SQLite without BEGIN IMMEDIATE (i.e. with BEGIN/deferred) lets readers and writers race in subtle ways under WAL. MC's app.py uses raw sqlite3 connections; not all writes are wrapped. Worth a 30-min sweep to make every multi-statement write go through one helper.

5.2 Idempotency keys for ticket creation

hermes_cli/kanban_db.py:1147 "Callers that care about idempotency should pass idempotency_key to create_task rather than rely on id uniqueness." Hermes uses a 4-byte random task id (4.3B space, ~1.2e-3 collision at 100k) and explicitly tells callers to pass an idempotency key for dedup. MC has the recurring "duplicate task" problem (per CLAUDE.md key rule #6); a tickets.idempotency_key UNIQUE column with INSERT … ON CONFLICT DO NOTHING would mechanize the rule. ~5 LOC + migration.

5.3 has_spawnable_ready() to distinguish "stuck" from "correctly idle"

hermes_cli/kanban_db.py:3446–3475. Health telemetry needs to distinguish "0 spawned because nothing's ready" from "0 spawned because something's ready but no profile can spawn it". Exactly the question MC's heartbeat dashboard sometimes can't answer. Borrow the pattern: a ticket assigned to a non-existent worker_role gets bucketed as skipped_nonspawnable, not idle.

5.4 Pre-mutation snapshot before destructive operations

agent/curator.py:1320–1329 snapshot_skills(reason="pre-curator-run"). Best-effort; never blocks the run; logged at debug if it fails. Pattern: any automated mutator that touches >1 file takes a tarball snapshot first. Cheap insurance for auto-skill-evolver and any future ticket-bulk-edit feature.

5.5 Fail-OPEN for judges, fail-CLOSED for breakers

hermes_cli/goals.py:18 "Judge failures are fail-OPEN: continue. A broken judge must not wedge progress; the turn budget is the backstop." vs. hermes_cli/kanban_db.py:3301 (circuit breaker is fail-CLOSED — trip on threshold).

The asymmetry is deliberate and worth codifying: anything advisory (LLM-as-judge, classifier, advisory webhook) should fail open with a hard backstop; anything authoritative (failure counter, claim CAS) should fail closed. We sometimes get this backwards — e.g. a transient classifier error blocks a ticket.


6. Stuff to explicitly NOT borrow


Recommended first surgical PR (if we cherry-pick anything)

If we do exactly one thing from this scan:

  1. Add claim_lock TEXT, claim_expires INTEGER, consecutive_failures INTEGER DEFAULT 0, max_retries INTEGER, last_heartbeat_at TEXT, last_failure_error TEXT columns to tickets.
  2. Add a ticket_runs table mirroring kanban's task_runs shape.
  3. Replace mc_pickup.py's pickup query with the single-statement CAS.
  4. Add release_stale_claims(), detect_crashed_workers(), enforce_max_runtime(), _record_task_failure() ports of the kanban_db functions, calling them from the existing tick.
  5. Wire gave_up events to a Telegram notification.

That's roughly 200–250 LOC plus a migration, and gets us atomic claims, TTL reclaim, crash detection, runtime caps, circuit breaking, and a usable run history — basically everything MC is currently missing on the worker-lifecycle side. Everything else in this report is optional after that.