Add monitor watchdog supervisor + expanded IXP hardening recipe by lunarthegrey · Pull Request #2 · unredacted/unifi-scripts

lunarthegrey · 2026-04-26T05:40:30Z

Summary

lib/monitor-watchdog.sh — cron-invoked supervisor that detects stuck-and-dead monitors via heartbeat-file mtime + PID liveness, then SIGKILL+restart. Lives in cron's process tree (outside ours) so it survives whatever pauses our user-space monitors.
Heartbeats in all three existing monitors (rules-monitor, switch-config-monitor, neighbor-poll-wrapper-monitor). The first two touch /var/run/<name>.heartbeat at the top of their main loops; the event-driven rules-monitor gets a dedicated 10s ticker subprocess.
MOUNT LOST logging in apply-neighbor-poll-wrapper.sh. Re-mounts after loss now emit !!! MOUNT LOST !!! plus an ISO-8601 UTC timestamp, and a daemon.warning syslog with tag neighbor-poll-wrapper. (Reapplied from prior abandoned branch.)
Expanded IXP hardening recipe in rules/conf/example.conf covering mcast_solicit, FORWARD-to-IX-subnet drops, IPv6 autoconf/RS suppression, and global arp_filter/arp_announce per SIX's published Linux guide.

Why

Production log evidence on edge1-mci1-net showed neighbor-poll-wrapper-monitor going silent for a contiguous 2.5-hour window (2026-04-24 02:38Z → 05:08Z) without any error, restart sequence, or "Killing old instance" line. PID was still in the pidfile but the process wasn't making progress. Shape is consistent with SIGSTOP from a UBIOS subsystem during config sync, later resumed by SIGCONT.

While the monitor was paused:

The arping bind-mount drifted off (UBIOS restored /usr/sbin/arping from package — observed 8 times over 9 days; ndisc6 mount never lost).
ubios-udapi-server's nl-neighbors-poll started hammering the real arping at ~44 calls/min on the SIX bridge.
The IX NOC saw a 411 pps broadcast / 7,000 pps multicast spike around the same window and administratively filtered the router off the fabric.

The existing monitors all have the same vulnerability: launched once from /data/on_boot.d, nothing supervises them, and their internal kill <old_pid> cleanup uses SIGTERM which doesn't deliver to a SIGSTOP'd process anyway. This PR addresses both gaps.

Failure modes addressed

Mode	Mechanism that catches it
Monitor process dies (segfault, OOM kill, etc.)	Watchdog: PID liveness check
Monitor `SIGSTOP`'d but PID still valid	Watchdog: heartbeat-mtime check
Wrapper bind-mount removed by UBIOS	Existing apply-script, now with loud logging on recovery
Kernel NUD broadcasting at 3 probes/cycle for unreachable IX peers	Documented `mcast_solicit=1` in example.conf for the host conf to adopt
Forwarded transit traffic leaking into IX subnet	Documented `iptables -A FORWARD -o brXXXX -d <IX-subnet> -j DROP`

What this does NOT fix yet

Root cause of the SIGSTOP/SIGCONT — we treat the symptom (monitor not making progress) rather than the cause. Phase-2 work is to capture /proc/<pid>/status transitions over time via the watchdog and correlate with UBIOS events.
Log rotation — all monitors and now the watchdog write unbounded log files. Separate PR.
Sub-10s recovery latency — max_silence_sec (60-90s) + 60s cron tick is the floor here; systemd units with WatchdogSec= would do better. Phase-2 once we've validated systemd-on-UBIOS path.

Test plan

Unit: dry-run the watchdog with a mock registry pointing at a test monitor that intentionally doesn't heartbeat → verify detection and restart logging
Integration on edge1-mci1-net (currently SIX-blocked, low risk):
- Deploy via Ansible
- kill -STOP $(cat /var/run/neighbor-poll-wrapper-monitor.pid) → expect watchdog to detect within 90s and restart
- kill -KILL $(cat /var/run/neighbor-poll-wrapper-monitor.pid) → expect restart within ~60s (next cron tick)
- Verify journalctl -t monitor-watchdog --since "5 minutes ago" shows the warning line
- 48-hour soak: re-pull /var/log/neighbor-poll-wrapper-monitor.log and confirm zero gaps >90s
Verify cron entry: cat /etc/cron.d/unifi-scripts-monitor-watchdog
Verify watchdog self-heartbeat: stat /var/run/monitor-watchdog.heartbeat mtime advances every minute

Commits

c874df6 lib: add cron-based monitor watchdog with heartbeat-staleness detection
731488e monitors: emit heartbeat file each cycle for monitor-watchdog
efed0f6 neighbor-poll-wrapper: loudly log and syslog wrapper bind-mount loss
f1992a8 docs: expand IXP hygiene recipe; document lib/monitor-watchdog

Each is reviewable in isolation.

🤖 Generated with Claude Code

Production log evidence on edge1-mci1-net showed the neighbor-poll-wrapper-monitor daemon going silent for a contiguous 2.5-hour window without leaving any error, restart sequence, or "Killing old instance" line in its log. The shape of the gap is consistent with the process being SIGSTOP'd by some UBIOS subsystem during config sync and later SIGCONT'd, rather than killed and restarted. While the monitor was stopped, its 30s-cadence cycle didn't run, the arping bind-mount drifted off, and the SIX peering fabric saw a storm — which contributed to SIX administratively filtering the router off the fabric. The same vulnerability exists in switch-config-monitor and rules-monitor: all three background themselves once at boot and nothing supervises them. This commit introduces lib/, a shared infrastructure directory: * lib/monitor-watchdog.sh — the supervisor itself * lib/install-monitor-watchdog-cron.sh — idempotent cron installer * lib/monitors.conf — registry of supervised monitors (5 fields per row: name, pidfile, heartbeat file, max_silence_sec, launcher) Why cron and not systemd: cron lives in the system service set, runs outside our process tree, and survives the same UBIOS reprovisioning cycles that silently SIGSTOP'd our user-space monitors. /etc/cron.d entries also persist across UBIOS firmware upgrades more reliably than /etc/systemd/system additions. A future Phase-2 migration to systemd units with WatchdogSec= would improve recovery latency from ~2min to ~10s, but cron is the lowest-friction path that works today on every UBIOS version we've tested. How it detects stuck (not just dead): Each supervised monitor touches a heartbeat file every cycle. The watchdog reads its mtime and treats anything older than max_silence_sec as stuck — which catches SIGSTOP'd processes whose PID is still valid but which aren't making progress. PID liveness alone wouldn't catch this case. How it kills (vs the existing in-monitor "kill old instance"): Our existing monitors send SIGTERM to old instances. SIGTERM cannot be delivered to a SIGSTOP'd process; it sits queued until a SIGCONT. The watchdog uses SIGKILL, which is always delivered, even to stopped processes. Worst-case recovery latency: max_silence_sec (60-90s) + 60s cron tick ≈ 2min vs the unbounded gap with no supervisor. Heartbeat plumbing in the existing monitors lands in a follow-up commit so the diff for each monitor stays narrowly reviewable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a per-monitor heartbeat file at /var/run/<name>.heartbeat that gets touched on every loop iteration. lib/monitor-watchdog.sh reads its mtime to detect a stuck monitor (PID alive but not making progress, e.g. SIGSTOP'd by UBIOS during config sync) and restart it. Per monitor: * switch-config-monitor — touch at top of main while-loop * neighbor-poll-wrapper-monitor — same pattern * rules-monitor — needs a dedicated 10s ticker subprocess because it's event-driven (event-only heartbeats wouldn't fire on hosts where the kernel/netlink/sysctl events are quiet, leaving the watchdog to mistake a healthy idle monitor for a stuck one). The ticker runs in its own subshell inside _monitor() so it dies if the parent dies. The heartbeat is touched BEFORE running the apply script so a slow or hung apply doesn't itself trip the watchdog — we only care about the monitor loop being alive, not about apply being fast. If apply is genuinely deadlocked the heartbeat will eventually go stale on the *next* iteration that never starts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-applies the change from the prior abandoned branch (commit 9bbb7de on fix/multiline-interface-block-truncation, never merged). When apply-neighbor-poll-wrapper.sh re-establishes a bind-mount that had previously been in place (evidenced by the .real backup file existing and being populated), the recovery is logged with a loud "!!! MOUNT LOST !!!" prefix and an ISO-8601 UTC timestamp, AND sent to syslog at daemon.warning priority with tag neighbor-poll-wrapper. This makes mount-loss events findable via journalctl and lets us correlate them against UBIOS provisioning events, instead of being indistinguishable from the routine "Bind-mounted ... -> /usr/sbin/arping" line that fires on first install. In production logs we previously saw: 8 arping mount-loss events over 9 days on edge1-mci1-net while ndisc6 was never lost — strongly suggesting UBIOS specifically restores /usr/sbin/arping (which it calls directly from nl-neighbors-poll) but not /usr/bin/ndisc6. The new alert lines will let us pin down the responsible UBIOS action the next time it happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rules/conf/example.conf: Replaces the brief "IXP / peering fabric hygiene" section with a comprehensive recipe drawn from SIX's Linux configuration guide and our own incident analysis. All directives are commented out by default; each is documented with the specific failure mode it prevents. New directives covered: • net.ipvX.neigh.<bridge>.mcast_solicit=1 Limits broadcast ARP/ND probes per resolution attempt to 1 instead of the kernel default of 3. Critical for IX bridges when peers go unreachable: the kernel keeps re-probing in a STALE→DELAY→PROBE→broadcast cycle, and the default count of 3 produces ~3× the broadcast volume per failed peer per cycle. Setting to 0 silences the kernel completely; useful as an emergency knob during IX-side filtering but breaks initial peer discovery so populate static neigh entries before relying on it long-term. • iptables -A FORWARD -o brXXXX -d <IX-subnet> -j DROP Per IX policy, only the router's IX-assigned IP may originate packets toward the IX subnet. Prevents accidental transit leaks (e.g. a customer getting a default route via this router) from reaching the fabric. • net.ipv6.conf.<bridge>.autoconf=0 • net.ipv6.conf.<bridge>.router_solicitations=0 Stop the bridge from sending IPv6 RS multicasts on link-up and from accepting RAs from random peers. IX routers know their own addressing. • net.ipv4.conf.all.arp_filter=1 • net.ipv4.conf.all.arp_announce=2 Multi-homed ARP hygiene. Without these, an ARP request arriving on the IX bridge can be answered with a MAC for an IP on a different interface — a common source of IX-fabric ACL violations. Per SIX's published Linux guide. README.md: New "lib/ — Shared infrastructure (monitor watchdog)" section explaining the supervisor architecture, the heartbeat protocol the monitors implement, and how to install via install-monitor-watchdog-cron.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A peer review of the IX hardening recipe correctly identified that the example.conf directives, while useful, are not the canonical trust boundary for IX-policy compliance. The hardware-enforced floor is a MAC ACL on the upstream UniFi aggregation switch — three rules per IX VLAN that catch every broadcast/multicast egress from the router's MAC at the switch's forwarding ASIC, bypassing every software failure mode the EFG could exhibit. This commit: * Documents the canonical ACL shape inline at the top of the IXP hygiene section, including the broadened IPv6 multicast pattern (33:33:00:00:00:00 / 00:00:ff:ff:ff:ff) which catches RS, RA, MLD, and NS — not just NS as previously documented. * Adds a third recommended rule for L2 control multicast (01:80:c2:00:00:00/0f) covering STP, LACP, LLDP, and EAPOL. * Reframes the existing inject-rules.conf directives as "EFG-internal hygiene" / "belt-and-suspenders" rather than the primary defense, and notes they are NOT load-bearing for IX compliance once the ACL is in place. This does not change any directive's behavior — it changes how operators should think about the layering. The agg-switch ACL is the trust boundary; everything in inject-rules.conf is optimization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A second peer review correctly observed that the three-rule canonical recipe leaves IPv4 multicast (01:00:5E:*) and Cisco/UBNT proprietary multicast (01:00:0C:*) uncaught. Neither should occur in steady state on an IX-facing port from a UBIOS gateway, but two production- plausible scenarios make them worth blocking explicitly: 1. Future UBIOS firmware enabling mDNS/SSDP/UBNT-DISC on every bridge (the same auto-enable behavior we've observed for arping mount restoration); operators learn about it from an IX shutdown notice rather than a release note. 2. A misconfigured VRRP, OSPF, or PIM group on the IX bridge leaking hello packets. Both rules cost essentially nothing in switch TCAM space and turn the ACL from "covers what we know breaks today" into "covers anything that's not unicast IP from our MAC" — the same property that made the existing three rules correct. Updated documentation includes the rationale so future operators extending this to new IX deployments understand why the rules look broader than strictly necessary for any single observed bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lunarthegrey and others added 6 commits April 26, 2026 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add monitor watchdog supervisor + expanded IXP hardening recipe#2

Add monitor watchdog supervisor + expanded IXP hardening recipe#2
lunarthegrey wants to merge 6 commits intomainfrom
feat/monitor-watchdog-and-ix-hardening

lunarthegrey commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lunarthegrey commented Apr 26, 2026

Summary

Why

Failure modes addressed

What this does NOT fix yet

Test plan

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant