⌂ Home ☷ Board

Tailscale on Luci — Degraded UDP / DERP Relay Path

Date: 2026-05-07 SAST Author: Luci (auto-diagnosed during session 2b2b56cb) Status: Degraded but functional — DATA PLANE WORKS via DERP relay; UDP NAT traversal broken Affected: Phone PWA (MC, audio streaming, anything chatty over Tailscale Funnel) Not affected: SSH from laptop ↔ Luci ↔ Larry (still works, lower-bandwidth use case tolerates DERP)


TL;DR

Server-side tailscaled cannot establish direct WireGuard UDP tunnels because outbound STUN/IPv4 binding requests get no response. All phone traffic relays through Tailscale's DERP-jnb proxy. RTT phone↔server is 241ms to 2.1s with massive variance, which makes Mission Control on mobile feel broken (slow, stuttering audio, page loads >30s).

Workaround in place: hit http://204.168.188.33:3001 (public IP, raw HTTP) bypasses Tailscale entirely. Confirmed fast on Elmar's phone 2026-05-07.

Permanent fix is a TLS-terminated direct path that doesn't depend on Tailscale (Caddy/Let's Encrypt or Cloudflare Tunnel). Filed pending the monitoring task below — if degradation persists for 7+ days we ship one of the solutions.

Symptoms (2026-05-07 SAST)

What we ruled out

Hypothesis Result
Server load / RAM / swap pressure RAM 854Mi free, load 0.94, swap 7.1Gi free. Fine.
MC backend slow localhost timings: /board 9ms, /api/board 1ms, / 50ms. Fine.
Stale service worker / PWA cache Bumped CACHE_NAME → v33 earlier session. Hard reload did not fix.
ext4 / disk error No kernel ext4 messages, write probes succeed.
ufw / iptables blocking UFW inactive; outbound UDP socket-open succeeds.
Tailscale software bug tailscaled restarted 2026-05-05 + 2026-05-07 — no change.
Phone client misconfig Phone IS on tailnet (elmars-s26-ultra active). Issue is at server's NAT-traversal layer.

Likely root cause

Hetzner cloud network silently filters or rate-limits outbound UDP responses to high-frequency STUN traffic from the gateway IP 172.31.1.1, OR a recent Hetzner network change broke return-path symmetry for Tailscale's discovery. Outbound packets leave the server; replies don't return. WireGuard data plane survives because it can fall back to DERP (TCP relay), but performance suffers.

We have no kernel-level evidence for what's blocking — no dmesg/iptables/route artefacts. Best guess is upstream Hetzner. A support ticket to Hetzner asking "is outbound UDP egress healthy from cloud server 204.168.188.33?" would help, but is a separate workstream.

Why phone gets the worst of it

Phone on cellular is behind CGNAT. Tailscale needs UDP hole-punching from BOTH ends to establish direct path. With server's UDP discovery broken, both ends fall back to DERP. DERP is TCP-based and routed through Tailscale's relay infrastructure (closest is Johannesburg). Each round trip adds ~30-200ms baseline plus jitter. PWAs make many small HTTP requests per page → latency stacks → unusable.

Laptop (when on home Wi-Fi, fixed IP, friendly NAT) tolerates the same DERP relay because SSH and similar low-frequency traffic doesn't notice the latency. Phone PWA does.

Workaround (in place 2026-05-07)

Use http://204.168.188.33:3001 directly from phone. MC binds 0.0.0.0:3001, public IP is reachable, no Tailscale dependency. Confirmed fast.

Caveats: - No TLS — credentials/tokens travel in clear (token-auth via cookie still works once set). - No PWA install (PWA/Service Worker requires HTTPS). Add to home screen still works in degraded form. - Exposes MC publicly — anyone with the IP can probe. Currently MC has token-auth on all mutating endpoints; read endpoints are open.

Permanent fix candidates

Option Effort TLS DNS Cost Notes
A) Caddy + Let's Encrypt + own domain 15 min needs A-record → 204.168.188.33 free Clean, standard. Picks up auto-renewing cert. Need a domain.
B) Caddy + duckdns/no-ip free subdomain 10 min none (free dynamic DNS) free No domain needed. Subdomain looks ugly.
C) Cloudflare Tunnel 10 min ✅ at edge none (cf hostname) free No inbound port opened. CF anycast = fast worldwide. Hides server IP. Recommended.

All three coexist with Tailscale (additive — Tailscale stays for SSH/admin). Recommend C for permanent fix because it requires no inbound port + survives if Hetzner's IP routing flakes.

Monitoring plan (filed alongside this report)

A scheduled task tailscale-watch will run every 30 min and:

  1. tailscale netcheck → parse UDP, IPv4, DERP fields
  2. tailscale ping -c 1 elmars-s26-ultra → check direct vs relayed
  3. Write JSON to ~/workspace/state/tailscale_health.json with rolling history
  4. If degraded state persists for >24h, fire one Telegram alert
  5. If degraded state persists for >7d (168h), update the parent MC ticket and recommend immediate switch to Cloudflare Tunnel

Fixes itself naturally if Hetzner network heals or tailscaled rediscovers a working path. Otherwise we have evidence-based escalation to ship the permanent fix.