Add monitor watchdog supervisor + expanded IXP hardening recipe#2
Open
lunarthegrey wants to merge 6 commits intomainfrom
Open
Add monitor watchdog supervisor + expanded IXP hardening recipe#2lunarthegrey wants to merge 6 commits intomainfrom
lunarthegrey wants to merge 6 commits intomainfrom
Conversation
Production log evidence on edge1-mci1-net showed the
neighbor-poll-wrapper-monitor daemon going silent for a contiguous
2.5-hour window without leaving any error, restart sequence, or
"Killing old instance" line in its log. The shape of the gap is
consistent with the process being SIGSTOP'd by some UBIOS subsystem
during config sync and later SIGCONT'd, rather than killed and
restarted. While the monitor was stopped, its 30s-cadence cycle
didn't run, the arping bind-mount drifted off, and the SIX peering
fabric saw a storm — which contributed to SIX administratively
filtering the router off the fabric.
The same vulnerability exists in switch-config-monitor and
rules-monitor: all three background themselves once at boot and
nothing supervises them.
This commit introduces lib/, a shared infrastructure directory:
* lib/monitor-watchdog.sh — the supervisor itself
* lib/install-monitor-watchdog-cron.sh — idempotent cron installer
* lib/monitors.conf — registry of supervised
monitors (5 fields per
row: name, pidfile,
heartbeat file,
max_silence_sec, launcher)
Why cron and not systemd:
cron lives in the system service set, runs outside our process
tree, and survives the same UBIOS reprovisioning cycles that
silently SIGSTOP'd our user-space monitors. /etc/cron.d entries
also persist across UBIOS firmware upgrades more reliably than
/etc/systemd/system additions. A future Phase-2 migration to
systemd units with WatchdogSec= would improve recovery latency from
~2min to ~10s, but cron is the lowest-friction path that works
today on every UBIOS version we've tested.
How it detects stuck (not just dead):
Each supervised monitor touches a heartbeat file every cycle. The
watchdog reads its mtime and treats anything older than
max_silence_sec as stuck — which catches SIGSTOP'd processes whose
PID is still valid but which aren't making progress. PID liveness
alone wouldn't catch this case.
How it kills (vs the existing in-monitor "kill old instance"):
Our existing monitors send SIGTERM to old instances. SIGTERM
cannot be delivered to a SIGSTOP'd process; it sits queued until a
SIGCONT. The watchdog uses SIGKILL, which is always delivered,
even to stopped processes.
Worst-case recovery latency:
max_silence_sec (60-90s) + 60s cron tick ≈ 2min vs the unbounded
gap with no supervisor.
Heartbeat plumbing in the existing monitors lands in a follow-up
commit so the diff for each monitor stays narrowly reviewable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-monitor heartbeat file at /var/run/<name>.heartbeat that
gets touched on every loop iteration. lib/monitor-watchdog.sh reads
its mtime to detect a stuck monitor (PID alive but not making
progress, e.g. SIGSTOP'd by UBIOS during config sync) and restart it.
Per monitor:
* switch-config-monitor — touch at top of main while-loop
* neighbor-poll-wrapper-monitor — same pattern
* rules-monitor — needs a dedicated 10s ticker subprocess because
it's event-driven (event-only heartbeats wouldn't fire on hosts
where the kernel/netlink/sysctl events are quiet, leaving the
watchdog to mistake a healthy idle monitor for a stuck one).
The ticker runs in its own subshell inside _monitor() so it dies
if the parent dies.
The heartbeat is touched BEFORE running the apply script so a slow
or hung apply doesn't itself trip the watchdog — we only care about
the monitor loop being alive, not about apply being fast. If apply
is genuinely deadlocked the heartbeat will eventually go stale on
the *next* iteration that never starts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-applies the change from the prior abandoned branch (commit 9bbb7de on fix/multiline-interface-block-truncation, never merged). When apply-neighbor-poll-wrapper.sh re-establishes a bind-mount that had previously been in place (evidenced by the .real backup file existing and being populated), the recovery is logged with a loud "!!! MOUNT LOST !!!" prefix and an ISO-8601 UTC timestamp, AND sent to syslog at daemon.warning priority with tag neighbor-poll-wrapper. This makes mount-loss events findable via journalctl and lets us correlate them against UBIOS provisioning events, instead of being indistinguishable from the routine "Bind-mounted ... -> /usr/sbin/arping" line that fires on first install. In production logs we previously saw: 8 arping mount-loss events over 9 days on edge1-mci1-net while ndisc6 was never lost — strongly suggesting UBIOS specifically restores /usr/sbin/arping (which it calls directly from nl-neighbors-poll) but not /usr/bin/ndisc6. The new alert lines will let us pin down the responsible UBIOS action the next time it happens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rules/conf/example.conf:
Replaces the brief "IXP / peering fabric hygiene" section with a
comprehensive recipe drawn from SIX's Linux configuration guide
and our own incident analysis. All directives are commented out
by default; each is documented with the specific failure mode it
prevents.
New directives covered:
• net.ipvX.neigh.<bridge>.mcast_solicit=1
Limits broadcast ARP/ND probes per resolution attempt to 1
instead of the kernel default of 3. Critical for IX bridges
when peers go unreachable: the kernel keeps re-probing in a
STALE→DELAY→PROBE→broadcast cycle, and the default count of 3
produces ~3× the broadcast volume per failed peer per cycle.
Setting to 0 silences the kernel completely; useful as an
emergency knob during IX-side filtering but breaks initial
peer discovery so populate static neigh entries before
relying on it long-term.
• iptables -A FORWARD -o brXXXX -d <IX-subnet> -j DROP
Per IX policy, only the router's IX-assigned IP may originate
packets toward the IX subnet. Prevents accidental transit
leaks (e.g. a customer getting a default route via this
router) from reaching the fabric.
• net.ipv6.conf.<bridge>.autoconf=0
• net.ipv6.conf.<bridge>.router_solicitations=0
Stop the bridge from sending IPv6 RS multicasts on link-up
and from accepting RAs from random peers. IX routers know
their own addressing.
• net.ipv4.conf.all.arp_filter=1
• net.ipv4.conf.all.arp_announce=2
Multi-homed ARP hygiene. Without these, an ARP request
arriving on the IX bridge can be answered with a MAC for an
IP on a different interface — a common source of IX-fabric
ACL violations. Per SIX's published Linux guide.
README.md:
New "lib/ — Shared infrastructure (monitor watchdog)" section
explaining the supervisor architecture, the heartbeat protocol the
monitors implement, and how to install via
install-monitor-watchdog-cron.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A peer review of the IX hardening recipe correctly identified that
the example.conf directives, while useful, are not the canonical
trust boundary for IX-policy compliance. The hardware-enforced floor
is a MAC ACL on the upstream UniFi aggregation switch — three rules
per IX VLAN that catch every broadcast/multicast egress from the
router's MAC at the switch's forwarding ASIC, bypassing every
software failure mode the EFG could exhibit.
This commit:
* Documents the canonical ACL shape inline at the top of the IXP
hygiene section, including the broadened IPv6 multicast pattern
(33:33:00:00:00:00 / 00:00:ff:ff:ff:ff) which catches RS, RA,
MLD, and NS — not just NS as previously documented.
* Adds a third recommended rule for L2 control multicast
(01:80:c2:00:00:00/0f) covering STP, LACP, LLDP, and EAPOL.
* Reframes the existing inject-rules.conf directives as
"EFG-internal hygiene" / "belt-and-suspenders" rather than the
primary defense, and notes they are NOT load-bearing for IX
compliance once the ACL is in place.
This does not change any directive's behavior — it changes how
operators should think about the layering. The agg-switch ACL is the
trust boundary; everything in inject-rules.conf is optimization.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second peer review correctly observed that the three-rule canonical
recipe leaves IPv4 multicast (01:00:5E:*) and Cisco/UBNT proprietary
multicast (01:00:0C:*) uncaught. Neither should occur in steady
state on an IX-facing port from a UBIOS gateway, but two production-
plausible scenarios make them worth blocking explicitly:
1. Future UBIOS firmware enabling mDNS/SSDP/UBNT-DISC on every
bridge (the same auto-enable behavior we've observed for arping
mount restoration); operators learn about it from an IX shutdown
notice rather than a release note.
2. A misconfigured VRRP, OSPF, or PIM group on the IX bridge
leaking hello packets.
Both rules cost essentially nothing in switch TCAM space and turn the
ACL from "covers what we know breaks today" into "covers anything
that's not unicast IP from our MAC" — the same property that made the
existing three rules correct.
Updated documentation includes the rationale so future operators
extending this to new IX deployments understand why the rules look
broader than strictly necessary for any single observed bug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rules-monitor,switch-config-monitor,neighbor-poll-wrapper-monitor). The first two touch/var/run/<name>.heartbeatat the top of their main loops; the event-drivenrules-monitorgets a dedicated 10s ticker subprocess.apply-neighbor-poll-wrapper.sh. Re-mounts after loss now emit!!! MOUNT LOST !!!plus an ISO-8601 UTC timestamp, and adaemon.warningsyslog with tagneighbor-poll-wrapper. (Reapplied from prior abandoned branch.)rules/conf/example.confcoveringmcast_solicit, FORWARD-to-IX-subnet drops, IPv6 autoconf/RS suppression, and globalarp_filter/arp_announceper SIX's published Linux guide.Why
Production log evidence on
edge1-mci1-netshowedneighbor-poll-wrapper-monitorgoing silent for a contiguous 2.5-hour window (2026-04-24 02:38Z → 05:08Z) without any error, restart sequence, or "Killing old instance" line. PID was still in the pidfile but the process wasn't making progress. Shape is consistent withSIGSTOPfrom a UBIOS subsystem during config sync, later resumed bySIGCONT.While the monitor was paused:
/usr/sbin/arpingfrom package — observed 8 times over 9 days; ndisc6 mount never lost).ubios-udapi-server'snl-neighbors-pollstarted hammering the real arping at ~44 calls/min on the SIX bridge.The existing monitors all have the same vulnerability: launched once from
/data/on_boot.d, nothing supervises them, and their internalkill <old_pid>cleanup usesSIGTERMwhich doesn't deliver to aSIGSTOP'd process anyway. This PR addresses both gaps.Failure modes addressed
SIGSTOP'd but PID still validmcast_solicit=1in example.conf for the host conf to adoptiptables -A FORWARD -o brXXXX -d <IX-subnet> -j DROPWhat this does NOT fix yet
/proc/<pid>/statustransitions over time via the watchdog and correlate with UBIOS events.max_silence_sec(60-90s) + 60s cron tick is the floor here; systemd units withWatchdogSec=would do better. Phase-2 once we've validated systemd-on-UBIOS path.Test plan
edge1-mci1-net(currently SIX-blocked, low risk):kill -STOP $(cat /var/run/neighbor-poll-wrapper-monitor.pid)→ expect watchdog to detect within 90s and restartkill -KILL $(cat /var/run/neighbor-poll-wrapper-monitor.pid)→ expect restart within ~60s (next cron tick)journalctl -t monitor-watchdog --since "5 minutes ago"shows the warning line/var/log/neighbor-poll-wrapper-monitor.logand confirm zero gaps >90scat /etc/cron.d/unifi-scripts-monitor-watchdogstat /var/run/monitor-watchdog.heartbeatmtime advances every minuteCommits
c874df6lib: add cron-based monitor watchdog with heartbeat-staleness detection731488emonitors: emit heartbeat file each cycle for monitor-watchdogefed0f6neighbor-poll-wrapper: loudly log and syslog wrapper bind-mount lossf1992a8docs: expand IXP hygiene recipe; document lib/monitor-watchdogEach is reviewable in isolation.
🤖 Generated with Claude Code