Skip to content

fix(hostagent): ensure VFs are unmanaged by NetworkManager via persistent udev rule#52

Open
tsorya wants to merge 6 commits into
NVIDIA:public-release-v26.4from
tsorya:igal/nm-unmanaged-vfs-pr
Open

fix(hostagent): ensure VFs are unmanaged by NetworkManager via persistent udev rule#52
tsorya wants to merge 6 commits into
NVIDIA:public-release-v26.4from
tsorya:igal/nm-unmanaged-vfs-pr

Conversation

@tsorya
Copy link
Copy Markdown

@tsorya tsorya commented May 22, 2026

Summary

  • Add EnsureVFsUnmanaged() to the Backend interface so NM-specific udev rule logic is encapsulated in NetworkManagerBackend while systemd-networkd cleanly no-ops
  • Move udev rule logic from hostagent/util/udev.go into netconfig/nm_udev.go
  • Mount /etc/udev/rules.d from the host into the hostagent container (HostPathDirectoryOrCreate) to persist the NM unmanaged rule across reboots
  • Skip udevadm reload/trigger when the rule file is already up-to-date (idempotency for the 30s reconcile loop)

Background

NetworkManager only evaluates NM_UNMANAGED when a device first appears. The rule must exist in persistent /etc/udev/rules.d/ before VFs are created so NM never manages them. Previously the file was written to the container's ephemeral filesystem, invisible to the host.

Test plan

  • Unit tests (nm_udev_test.go): rule written + udevadm triggered on first run, idempotent skip, overwrite on mismatch, mkdir parents, error paths
  • Deploy hostdriver + controller and verify /etc/udev/rules.d/10-nm-unmanaged.rules on host
  • Verify VFs show unmanaged in nmcli device status

tsorya and others added 5 commits May 22, 2026 00:58
…tent udev rule

Add EnsureVFsUnmanaged() to the Backend interface so that NM-specific
udev rule logic is encapsulated in NetworkManagerBackend while
systemd-networkd cleanly no-ops.

Key changes:
- Move udev rule logic from hostagent/util into netconfig/nm_udev.go
- Mount /etc/udev/rules.d from host into hostagent container to persist
  the rule across reboots (HostPathDirectoryOrCreate)
- Skip udevadm reload/trigger when rule file is already up-to-date

NetworkManager only evaluates NM_UNMANAGED when a device first appears,
so the rule must be present before VFs are created.

Co-authored-by: Cursor <cursoragent@cursor.com>
…and MTU flapping

Three related issues caused hostagent networking failures on hosts with
SR-IOV VFs as bridge members:

1. VF MTU via netlink: VFs (e.g. ens7f0v0) that are bridge members cannot
   have their MTU configured through NetworkManager — NM activation fails
   because VF connection profiles conflict with the PF-managed config.
   Detect VFs via sysfs physfn symlink and set their MTU directly via
   netlink instead.

2. MAC address stripping: When reading a NM connection profile and writing
   it back (round-trip), the 802-3-ethernet.mac-address property causes
   Update failures for VFs whose MAC is managed by the PF. Add mac-address
   to unsafeRoundtripProps so it is stripped before calling NM Update.

3. Profile MTU check to prevent flapping: When the NM profile already has
   the desired MTU but the kernel link MTU temporarily differs (e.g. after
   driver reload), the hostagent would re-activate the connection on every
   reconcile loop, bouncing the interface. Check the NM profile MTU first
   and skip activation when it already matches.

4. Set interface-name on bridge member connections: Without interface-name,
   NM may match the wrong connection profile when multiple connections exist
   for the same interface type.

Co-authored-by: Cursor <cursoragent@cursor.com>
AddNetworkRequest returned early when a request already existed,
ignoring the provided vfCount. Now updates NumOfVFs and persists
the change so the next processing cycle applies it.
Comment thread internal/provisioning/hostagent/util/netconfig/nm_udev.go
If writeUdevRuleFile() succeeds but reloadAndTriggerUdev() fails, the
next reconcile would see the file already up-to-date and skip the
reload/trigger, leaving the system in a broken state.

Add an in-memory udevRulesApplied flag that only becomes true after a
successful reload/trigger. This ensures retry on the next reconcile
while still skipping redundant udevadm calls in the steady state.

Signed-off-by: Igal Tsoiref <itsoiref@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants