ccgram: Telegram RetryAfter/rate-limit delivery hardening for luci-persistent/foreign tmux sessions
## Bug: ccgram does not deliver INBOUND messages from foreign tmux sessions (e.g. luci-persistent) **Reported:** Elmar, 2026-06-10. "Messages TO luci-persistent work, but I get...
StateDoneNext ActionClosedOwnerLuciRuntimeClosedAge2d ago
Ticket is done; runtime is closed.·profile claude_opus_1m_medium
Description
MC-4940
## Bug: ccgram does not deliver INBOUND messages from foreign tmux sessions (e.g. luci-persistent)
**Reported:** Elmar, 2026-06-10. "Messages TO luci-persistent work, but I get nothing back FROM it."
### Symptom
- Outbound (Telegram topic → luci-persistent): works (send_keys to the bound window).
- Inbound (luci-persistent output → Telegram): silently dropped. Same for any non-`ccgram:` tmux session (e.g. `mc-root:`).
### Evidence chain (all verified live on Luci)
1. Topic 1227 (user 246672479) is bound to foreign window `luci-persistent:@2` in `~/.ccgram/state.json` thread_bindings. ✓ (outbound proves the binding.)
2. luci-persistent's Claude hooks fire — `~/.ccgram/events.jsonl` has its Stop/Notification/SessionStart events, tagged `session_id=2952377d-9acb…`, `window_id=None` (events carry sid only, not window). ✓
3. `state.json` WindowState `luci-persistent:@2` is correct: `session_id=2952377d…`, `external=true`. ✓
4. `session_resolver.find_users_for_session(sid)` iterates thread_bindings, matches WindowState.session_id → would resolve topic 1227. ✓
5. **ROOT CAUSE — `session_monitor.py:255-256`:**
```python
prefix = f"{config.tmux_session_name}:" # = "ccgram:"
return parse_session_map(raw, prefix)
```
`_load_current_session_map()` prefix-filters session_map to ONLY `ccgram:` keys. `_cleanup_all_stale_sessions` (264) and `_detect_and_cleanup_changes` (280) build `active_session_ids` from this filtered map, so foreign keys (`luci-persistent:@2`, `mc-root:@1`) are excluded → their session_ids are never in the tracked/active set → their transcripts are never tailed for content → inbound content is never routed. Native `ccgram:@N` windows (e.g. this session @30) match the prefix and deliver fine — which is why only foreign sessions are affected.
### Fix direction
Broaden the monitored set to include **explicitly thread-bound foreign/external sessions**, not just `{tmux_session_name}:`-prefixed ones. Options:
- (a) In `_load_current_session_map`, union the prefix-matched entries with any session_map entry whose window_id is present in `thread_router.iter_thread_bindings()` (i.e. a user is bound to it). emdash foreign windows already work this way — luci-persistent is the manually-bound foreign case the prefix filter misses.
- (b) Or have `parse_session_map` keep entries for windows with a live binding regardless of prefix.
Keep `make check` green (4500+ unit/integration tests); add a regression that a bound non-`ccgram:` session's new assistant message routes to its topic. Be careful NOT to start monitoring truly-dead sessions (see hygiene item).
### Secondary (hygiene, not the root cause)
`~/.ccgram/session_map.json` has **1590 entries**, most dead (`mc-MC-26xx:@N`, April-era, transcripts missing on disk). Prune entries whose transcript_path is gone AND window no longer exists. Reduces monitor scan cost + reverse-lookup noise.
### Constraints
- ccgram is the SOLE Telegram poller (MC-2617). Any change requires a careful `ccgram.service` restart — two pollers on one token = 409 + outage. Restart ccgram only after `make check` passes.
- File: `~/workspace/ccgram/src/ccgram/session_monitor.py` (+ `session_map.py` parse_session_map). Tier 2, dev-loop.
Activity
done
INTERACTIVE
Luci is working...
Details —
Done
· High
· Luci
▼
SState
Done
Closed
PPeople
TTiming / Details▼
api (human)
Mission Control
2d ago
2d ago
Advanced / Operator evidence
RRouting owner
ROperator console
Ticket is done; runtime is closed.MC is visibility-only. Hermes Luci launches and gates work outside MC, then mirrors evidence/status here.
WWorkflow
Start Dev Review + QA ▾
Select phases to include:
Agents
Review Gates
Decision
WAT routing: choose an agent, review gate, or decision. Buttons use the live runtime when one is attached.
CORRECTION — root cause was WRONG, do NOT implement the prefix-filter fix. (Luci, 2026-06-10)
I disproved my own diagnosis with a live wired probe + service logs:
1. Live probe (SessionManager wired from state.json, ran parse_session_map against the real session_map):
- `_bound_foreign_window_ids()` = ['luci-persistent:@2', 'mc-root:@1']
- parse_session_map("ccgram:") INCLUDES 'luci-persistent:@2' (session_id 2952377d) — total 6 live entries.
→ The foreign-session monitoring already works (parse_session_map has explicit bound-foreign handling; docstring even names luci-persistent). The prefix filter is NOT the bug.
2. All wiring verified correct: thread_binding topic1227→luci-persistent:@2, WindowState session_id=2952377d external=true, group_chat_ids['246672479:1227'] present, hooks firing into events.jsonl.
3. journalctl -u ccgram.service shows ccgram DOES act on luci-persistent:
`Updated topic emoji ... thread=1227 ... window_id=luci-persistent:@2` (status pushed to the topic).
ACTUAL evidence of the failure (same logs):
- `Rate limit hit after maximum of 5 retries [telegram.ext.AIORateLimiter]`
- `telegram.error.RetryAfter: Flood control exceeded. Retry in 3 seconds`
- repeated `Reset Telegram polling HTTP client after TimedOut/NetworkError ... window_id=luci-persistent:@2`
→ Real cause is Telegram FLOOD-CONTROL / rate-limiting + network timeouts: outbound sends are dropped after 5 retries. The high-volume orchestrator topic (luci-persistent) is the verbose one, so it loses the most messages while quieter topics deliver. Aggravated by very chatty topics burning the shared per-chat rate budget (status-emoji probes + every assistant message + tool-call traffic all compete).
REVISED fix direction (replaces the prefix-filter plan):
- Delivery resilience: on RetryAfter, honor the retry-after and re-queue rather than dropping after 5 fixed retries; per-chat send queue with backoff.
- Cut rate-limit pressure: reduce status-emoji/topic-probe frequency (they throw "Flood control exceeded" in the logs and burn quota that real messages need); coalesce rapid status updates.
- Investigate the polling-client ReadError/Timeout resets (network layer) separately.
Needs live observation to confirm which lever matters most; this is a delivery/rate-limit hardening task, NOT a routing change. Keep ticket open. Title should be retitled to reflect rate-limit/flood-control delivery loss.
luci-board-manager2d ago
[visibility-only] Controller routed the corrected Telegram RetryAfter/rate-limit root cause to internal Kanban card t_946a6a4f in isolated ccgram worktree /home/lucienne/workspace/_mc_internal_worktrees/MC-4940-telegram-rate-limit. This replaces the disproven prefix-filter lane; no human input is needed unless credentials/product policy become relevant. MC remains ledger-only; no MC pickup/runtime endpoints used.
luci-board-manager2d ago
[visibility-only] Controller consumed the blocked internal Kanban handoff t_946a6a4f and routed the ccgram RetryAfter fix review/landing/deploy gate to child card t_4aed9bc9. Verified worker is running as codexbuilder in isolated cwd /home/lucienne/workspace/_mc_internal_worktrees/MC-4940-telegram-rate-limit (pid 1104543). No MC runtime/pickup endpoints used; no human input needed unless credentials/product/spend become relevant.
luci-board-manager2d ago
[controller-route] Visibility-only MC: routed Luci-owned technical ticket to internal Kanban `mc-internal` card t_8af6080a (codexbuilder) in isolated worktree `/home/lucienne/workspace/_mc_internal_worktrees/MC-4940-ccgram-inbound`. Worker PID 877251 cwd verified before mirroring. MC will remain the ledger; no MC runtime/pickup was started. Controller will gate commit/tests/restart evidence before closing.
luci-board-manager2d ago
[controller-route] Visibility-only MC: parent Kanban card t_8af6080a produced review-required commit b8c2cdf. Routed controller review/landing gate to internal Kanban `t_d0ca4cc6` (claudebuilder) in the same isolated worktree `/home/lucienne/workspace/_mc_internal_worktrees/MC-4940-ccgram-inbound`. Dispatch status=running pid=913661 cwd=ERR:[Errno 2] No such file or directory: '/proc/913661/cwd'. No MC runtime/pickup was started. MC remains waiting on `kanban_active:t_d0ca4cc6` until review/landing evidence is gated.
luci-board-manager2d ago
[controller-route-repair] Internal Kanban review card `t_d0ca4cc6` did not keep a live claudebuilder PID, so the controller reclaimed/reassigned the same card to codexbuilder fallback and dispatched once. Verified task status=running pid=915756 cwd=/home/lucienne/workspace/_mc_internal_worktrees/MC-4940-ccgram-inbound. MC remains visibility-only and waiting on `kanban_active:t_d0ca4cc6`.
luci-board-manager2d ago
[controller-gate-pass] MC-4940 landed and deployed. Verified commit b8c2cdf from internal Kanban was cherry-picked to live ccgram branch feat/mc-4139-addtask as 930f2a22e467dbc2a93c8e32633e593515c3fd53; fork/feat pushed to the same SHA. Live validation: focused session-map pytest 11 passed, monitor pytest 37 passed, git diff --check passed, ruff passed. uv tool reinstall from /home/lucienne/workspace/ccgram succeeded; sudo systemctl restart ccgram.service succeeded; service active with MainPID 961669 and startup logs show ccgram 3.1.3.dev20, session monitor/status polling started. Rollback: reset live branch/tool to fe134b6dde94733a535a90a9bc15fb85e222bae5 and restart ccgram.service.
luci-board-manager2d ago
[controller-freeze] Wrong-root correction detected after the earlier MC-4940 landing: the latest system evidence says the prefix-filter diagnosis was wrong and points instead to Telegram flood/rate-limit handling. I am freezing this ticket rather than re-dispatching the same root cause loop. Next controller action is a fresh Luci-owned triage/repair lane for the rate-limit root and/or rollback decision; no human input is needed unless credentials or product policy become relevant. Human Inbox/backlog untouched; no MC runtime/pickup endpoints used.
luci-board-manager2d ago
[controller-gate-pass] Visibility-only gate closed MC-4940 after the corrected RetryAfter/rate-limit lane completed. Evidence verified this tick: internal Kanban card t_4aed9bc9 is done; live ccgram.service is active/running with MainPID 1111608; `/home/lucienne/.local/bin/ccgram --version` reports 3.1.3.dev21; the live ccgram checkout is at 4fc8b999c66af5a88f70018739268817a1a9109f (fix(ccgram): preserve queued sends on Telegram flood control). No MC pickup/runtime endpoints were used. Rollback note: revert ccgram to the previous known-good ref and restart ccgram.service if Telegram delivery regresses.