Skip to content

Add monitor watchdog supervisor + expanded IXP hardening recipe#2

Open
lunarthegrey wants to merge 6 commits intomainfrom
feat/monitor-watchdog-and-ix-hardening
Open

Add monitor watchdog supervisor + expanded IXP hardening recipe#2
lunarthegrey wants to merge 6 commits intomainfrom
feat/monitor-watchdog-and-ix-hardening

Conversation

@lunarthegrey
Copy link
Copy Markdown
Contributor

Summary

  • lib/monitor-watchdog.sh — cron-invoked supervisor that detects stuck-and-dead monitors via heartbeat-file mtime + PID liveness, then SIGKILL+restart. Lives in cron's process tree (outside ours) so it survives whatever pauses our user-space monitors.
  • Heartbeats in all three existing monitors (rules-monitor, switch-config-monitor, neighbor-poll-wrapper-monitor). The first two touch /var/run/<name>.heartbeat at the top of their main loops; the event-driven rules-monitor gets a dedicated 10s ticker subprocess.
  • MOUNT LOST logging in apply-neighbor-poll-wrapper.sh. Re-mounts after loss now emit !!! MOUNT LOST !!! plus an ISO-8601 UTC timestamp, and a daemon.warning syslog with tag neighbor-poll-wrapper. (Reapplied from prior abandoned branch.)
  • Expanded IXP hardening recipe in rules/conf/example.conf covering mcast_solicit, FORWARD-to-IX-subnet drops, IPv6 autoconf/RS suppression, and global arp_filter/arp_announce per SIX's published Linux guide.

Why

Production log evidence on edge1-mci1-net showed neighbor-poll-wrapper-monitor going silent for a contiguous 2.5-hour window (2026-04-24 02:38Z → 05:08Z) without any error, restart sequence, or "Killing old instance" line. PID was still in the pidfile but the process wasn't making progress. Shape is consistent with SIGSTOP from a UBIOS subsystem during config sync, later resumed by SIGCONT.

While the monitor was paused:

  1. The arping bind-mount drifted off (UBIOS restored /usr/sbin/arping from package — observed 8 times over 9 days; ndisc6 mount never lost).
  2. ubios-udapi-server's nl-neighbors-poll started hammering the real arping at ~44 calls/min on the SIX bridge.
  3. The IX NOC saw a 411 pps broadcast / 7,000 pps multicast spike around the same window and administratively filtered the router off the fabric.

The existing monitors all have the same vulnerability: launched once from /data/on_boot.d, nothing supervises them, and their internal kill <old_pid> cleanup uses SIGTERM which doesn't deliver to a SIGSTOP'd process anyway. This PR addresses both gaps.

Failure modes addressed

Mode Mechanism that catches it
Monitor process dies (segfault, OOM kill, etc.) Watchdog: PID liveness check
Monitor SIGSTOP'd but PID still valid Watchdog: heartbeat-mtime check
Wrapper bind-mount removed by UBIOS Existing apply-script, now with loud logging on recovery
Kernel NUD broadcasting at 3 probes/cycle for unreachable IX peers Documented mcast_solicit=1 in example.conf for the host conf to adopt
Forwarded transit traffic leaking into IX subnet Documented iptables -A FORWARD -o brXXXX -d <IX-subnet> -j DROP

What this does NOT fix yet

  • Root cause of the SIGSTOP/SIGCONT — we treat the symptom (monitor not making progress) rather than the cause. Phase-2 work is to capture /proc/<pid>/status transitions over time via the watchdog and correlate with UBIOS events.
  • Log rotation — all monitors and now the watchdog write unbounded log files. Separate PR.
  • Sub-10s recovery latencymax_silence_sec (60-90s) + 60s cron tick is the floor here; systemd units with WatchdogSec= would do better. Phase-2 once we've validated systemd-on-UBIOS path.

Test plan

  • Unit: dry-run the watchdog with a mock registry pointing at a test monitor that intentionally doesn't heartbeat → verify detection and restart logging
  • Integration on edge1-mci1-net (currently SIX-blocked, low risk):
    • Deploy via Ansible
    • kill -STOP $(cat /var/run/neighbor-poll-wrapper-monitor.pid) → expect watchdog to detect within 90s and restart
    • kill -KILL $(cat /var/run/neighbor-poll-wrapper-monitor.pid) → expect restart within ~60s (next cron tick)
    • Verify journalctl -t monitor-watchdog --since "5 minutes ago" shows the warning line
    • 48-hour soak: re-pull /var/log/neighbor-poll-wrapper-monitor.log and confirm zero gaps >90s
  • Verify cron entry: cat /etc/cron.d/unifi-scripts-monitor-watchdog
  • Verify watchdog self-heartbeat: stat /var/run/monitor-watchdog.heartbeat mtime advances every minute

Commits

  1. c874df6 lib: add cron-based monitor watchdog with heartbeat-staleness detection
  2. 731488e monitors: emit heartbeat file each cycle for monitor-watchdog
  3. efed0f6 neighbor-poll-wrapper: loudly log and syslog wrapper bind-mount loss
  4. f1992a8 docs: expand IXP hygiene recipe; document lib/monitor-watchdog

Each is reviewable in isolation.

🤖 Generated with Claude Code

lunarthegrey and others added 6 commits April 26, 2026 00:39
Production log evidence on edge1-mci1-net showed the
neighbor-poll-wrapper-monitor daemon going silent for a contiguous
2.5-hour window without leaving any error, restart sequence, or
"Killing old instance" line in its log.  The shape of the gap is
consistent with the process being SIGSTOP'd by some UBIOS subsystem
during config sync and later SIGCONT'd, rather than killed and
restarted.  While the monitor was stopped, its 30s-cadence cycle
didn't run, the arping bind-mount drifted off, and the SIX peering
fabric saw a storm — which contributed to SIX administratively
filtering the router off the fabric.

The same vulnerability exists in switch-config-monitor and
rules-monitor: all three background themselves once at boot and
nothing supervises them.

This commit introduces lib/, a shared infrastructure directory:

  * lib/monitor-watchdog.sh              — the supervisor itself
  * lib/install-monitor-watchdog-cron.sh — idempotent cron installer
  * lib/monitors.conf                    — registry of supervised
                                            monitors (5 fields per
                                            row: name, pidfile,
                                            heartbeat file,
                                            max_silence_sec, launcher)

Why cron and not systemd:
  cron lives in the system service set, runs outside our process
  tree, and survives the same UBIOS reprovisioning cycles that
  silently SIGSTOP'd our user-space monitors.  /etc/cron.d entries
  also persist across UBIOS firmware upgrades more reliably than
  /etc/systemd/system additions.  A future Phase-2 migration to
  systemd units with WatchdogSec= would improve recovery latency from
  ~2min to ~10s, but cron is the lowest-friction path that works
  today on every UBIOS version we've tested.

How it detects stuck (not just dead):
  Each supervised monitor touches a heartbeat file every cycle.  The
  watchdog reads its mtime and treats anything older than
  max_silence_sec as stuck — which catches SIGSTOP'd processes whose
  PID is still valid but which aren't making progress.  PID liveness
  alone wouldn't catch this case.

How it kills (vs the existing in-monitor "kill old instance"):
  Our existing monitors send SIGTERM to old instances.  SIGTERM
  cannot be delivered to a SIGSTOP'd process; it sits queued until a
  SIGCONT.  The watchdog uses SIGKILL, which is always delivered,
  even to stopped processes.

Worst-case recovery latency:
  max_silence_sec (60-90s) + 60s cron tick ≈ 2min vs the unbounded
  gap with no supervisor.

Heartbeat plumbing in the existing monitors lands in a follow-up
commit so the diff for each monitor stays narrowly reviewable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-monitor heartbeat file at /var/run/<name>.heartbeat that
gets touched on every loop iteration.  lib/monitor-watchdog.sh reads
its mtime to detect a stuck monitor (PID alive but not making
progress, e.g. SIGSTOP'd by UBIOS during config sync) and restart it.

Per monitor:

  * switch-config-monitor — touch at top of main while-loop
  * neighbor-poll-wrapper-monitor — same pattern
  * rules-monitor — needs a dedicated 10s ticker subprocess because
    it's event-driven (event-only heartbeats wouldn't fire on hosts
    where the kernel/netlink/sysctl events are quiet, leaving the
    watchdog to mistake a healthy idle monitor for a stuck one).
    The ticker runs in its own subshell inside _monitor() so it dies
    if the parent dies.

The heartbeat is touched BEFORE running the apply script so a slow
or hung apply doesn't itself trip the watchdog — we only care about
the monitor loop being alive, not about apply being fast.  If apply
is genuinely deadlocked the heartbeat will eventually go stale on
the *next* iteration that never starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-applies the change from the prior abandoned branch (commit 9bbb7de
on fix/multiline-interface-block-truncation, never merged).

When apply-neighbor-poll-wrapper.sh re-establishes a bind-mount that
had previously been in place (evidenced by the .real backup file
existing and being populated), the recovery is logged with a loud
"!!! MOUNT LOST !!!" prefix and an ISO-8601 UTC timestamp, AND sent
to syslog at daemon.warning priority with tag neighbor-poll-wrapper.

This makes mount-loss events findable via journalctl and lets us
correlate them against UBIOS provisioning events, instead of being
indistinguishable from the routine "Bind-mounted ... -> /usr/sbin/arping"
line that fires on first install.

In production logs we previously saw: 8 arping mount-loss events over
9 days on edge1-mci1-net while ndisc6 was never lost — strongly
suggesting UBIOS specifically restores /usr/sbin/arping (which it
calls directly from nl-neighbors-poll) but not /usr/bin/ndisc6.  The
new alert lines will let us pin down the responsible UBIOS action
the next time it happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rules/conf/example.conf:
  Replaces the brief "IXP / peering fabric hygiene" section with a
  comprehensive recipe drawn from SIX's Linux configuration guide
  and our own incident analysis.  All directives are commented out
  by default; each is documented with the specific failure mode it
  prevents.

  New directives covered:
    • net.ipvX.neigh.<bridge>.mcast_solicit=1
        Limits broadcast ARP/ND probes per resolution attempt to 1
        instead of the kernel default of 3.  Critical for IX bridges
        when peers go unreachable: the kernel keeps re-probing in a
        STALE→DELAY→PROBE→broadcast cycle, and the default count of 3
        produces ~3× the broadcast volume per failed peer per cycle.
        Setting to 0 silences the kernel completely; useful as an
        emergency knob during IX-side filtering but breaks initial
        peer discovery so populate static neigh entries before
        relying on it long-term.

    • iptables -A FORWARD -o brXXXX -d <IX-subnet> -j DROP
        Per IX policy, only the router's IX-assigned IP may originate
        packets toward the IX subnet.  Prevents accidental transit
        leaks (e.g. a customer getting a default route via this
        router) from reaching the fabric.

    • net.ipv6.conf.<bridge>.autoconf=0
    • net.ipv6.conf.<bridge>.router_solicitations=0
        Stop the bridge from sending IPv6 RS multicasts on link-up
        and from accepting RAs from random peers.  IX routers know
        their own addressing.

    • net.ipv4.conf.all.arp_filter=1
    • net.ipv4.conf.all.arp_announce=2
        Multi-homed ARP hygiene.  Without these, an ARP request
        arriving on the IX bridge can be answered with a MAC for an
        IP on a different interface — a common source of IX-fabric
        ACL violations.  Per SIX's published Linux guide.

README.md:
  New "lib/ — Shared infrastructure (monitor watchdog)" section
  explaining the supervisor architecture, the heartbeat protocol the
  monitors implement, and how to install via
  install-monitor-watchdog-cron.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A peer review of the IX hardening recipe correctly identified that
the example.conf directives, while useful, are not the canonical
trust boundary for IX-policy compliance.  The hardware-enforced floor
is a MAC ACL on the upstream UniFi aggregation switch — three rules
per IX VLAN that catch every broadcast/multicast egress from the
router's MAC at the switch's forwarding ASIC, bypassing every
software failure mode the EFG could exhibit.

This commit:

  * Documents the canonical ACL shape inline at the top of the IXP
    hygiene section, including the broadened IPv6 multicast pattern
    (33:33:00:00:00:00 / 00:00:ff:ff:ff:ff) which catches RS, RA,
    MLD, and NS — not just NS as previously documented.

  * Adds a third recommended rule for L2 control multicast
    (01:80:c2:00:00:00/0f) covering STP, LACP, LLDP, and EAPOL.

  * Reframes the existing inject-rules.conf directives as
    "EFG-internal hygiene" / "belt-and-suspenders" rather than the
    primary defense, and notes they are NOT load-bearing for IX
    compliance once the ACL is in place.

This does not change any directive's behavior — it changes how
operators should think about the layering.  The agg-switch ACL is the
trust boundary; everything in inject-rules.conf is optimization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second peer review correctly observed that the three-rule canonical
recipe leaves IPv4 multicast (01:00:5E:*) and Cisco/UBNT proprietary
multicast (01:00:0C:*) uncaught.  Neither should occur in steady
state on an IX-facing port from a UBIOS gateway, but two production-
plausible scenarios make them worth blocking explicitly:

  1. Future UBIOS firmware enabling mDNS/SSDP/UBNT-DISC on every
     bridge (the same auto-enable behavior we've observed for arping
     mount restoration); operators learn about it from an IX shutdown
     notice rather than a release note.

  2. A misconfigured VRRP, OSPF, or PIM group on the IX bridge
     leaking hello packets.

Both rules cost essentially nothing in switch TCAM space and turn the
ACL from "covers what we know breaks today" into "covers anything
that's not unicast IP from our MAC" — the same property that made the
existing three rules correct.

Updated documentation includes the rationale so future operators
extending this to new IX deployments understand why the rules look
broader than strictly necessary for any single observed bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant