Skip to content

ls-apis needs to detect cycles in dependency unit graph#9707

Open
davepacheco wants to merge 17 commits intomainfrom
dap/ls-apis-fix
Open

ls-apis needs to detect cycles in dependency unit graph#9707
davepacheco wants to merge 17 commits intomainfrom
dap/ls-apis-fix

Conversation

@davepacheco
Copy link
Collaborator

Fixes #9706.

This is a prototype.

@davepacheco davepacheco self-assigned this Jan 22, 2026
Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunshowers what do you think of this?

| ("mgd", "ddm-admin-client")
| ("mgd", "dpd-client")
| ("mgd", "gateway-client")
| ("omicron-sled-agent", "gateway-client")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ug. Every sled-agent also talks to both MGS instances. I'm not sure it strictly needs to, but would need to talk to the networking folks more to understand if we can take it out. This happens here:

info!(
self.log, "Querying MGS to determine switch location";
"addr" => %addr,
);
let switch_slot = mgs_client
.sp_local_switch_id()
.await
.with_context(|| format!("Failed to query MGS at {addr}"))?
.into_inner()
.slot;

The callers are:

  • RSS, so it can pass a HashMap<SwitchLocation, Ipv6Addr> to Nexus (seems like in principle Nexus could figure this out on its own?)
  • OPTE port creation, so it match up "which switch(es) has/have configured uplinks" with "what's the IP of the dendrite on that switch" (maybe we could always try to look up all dendrite instances regardless of configured uplinks, which means we wouldn't need to talk to MGS?)

Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm verifying the specific cases here one-by-one and have found a few mistakes so far. @sunshowers did you verify Claude's analysis here:
#9708 (comment)

I found these errors by just manually grep'ing each repo for the corresponding client. For example, looking at how sled agent uses mgs, I went into omicron, cd sled-agent, git grep gateway_client, and found a use that Claude missed. I also looked at the one it reported but I think it's incorrect -- this one is always on the local system (not localhost, but that's not the constraint we care about). Could you check these? (I stopped partway through.)

Comment on lines 607 to 610
[[localhost_only_edges]]
server = "ddmd"
client = "dpd-client"
note = "verified: sled-agent configures ddmd with dpd_host=[::1] (services.rs:3127)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddmd runs in both the global zone and the switch zone (see documentation at L261-L268 above). The referenced code at services.rs L3127 appears to be only the one in the switch zone. I'm not sure where the GZ one has dpd configured, or if it does. Just looking at sled 14 on rack2, the dendrite-related SMF properties are empty strings and the start method does not seem to have configured the corresponding CLI arguments:

BRM42220051 # svcprop -p config svc:/oxide/mg-ddm:default | grep dpd_
config/dpd_host astring ""
config/dpd_port astring ""

BRM42220051 # svcs -p mg-ddm
STATE          STIME    FMRI
online         1986     svc:/oxide/mg-ddm:default
               1986          671 ctrun
BRM42220051 # ptree 671
671    ctrun -l child -o noorphan,regent /opt/oxide/mg-ddm/pkg/ddm_method_script.sh
  672    /opt/oxide/mg-ddm/ddmd --with-stats --admin-port 8000 --admin-addr ::1 --ki
BRM42220051 # pargs 672
672:    /opt/oxide/mg-ddm/ddmd --with-stats --admin-port 8000 --admin-addr ::1 --kind s
argv[0]: /opt/oxide/mg-ddm/ddmd
argv[1]: --with-stats
argv[2]: --admin-port
argv[3]: 8000
argv[4]: --admin-addr
argv[5]: ::1
argv[6]: --kind
argv[7]: server
argv[8]: -a
argv[9]: cxgbe0/ll
argv[10]: -a
argv[11]: cxgbe1/ll
argv[12]: --dns-servers
argv[13]: unknown

It's worth confirming the expected behavior here. Either way, the note here is incomplete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's configured at https://github.com/oxidecomputer/maghemite/blob/396bb3c/smf/ddm/manifest.xml#L25 -- added a link to that.


[[localhost_only_edges]]
server = "dpd"
client = "gateway-client"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the referenced source: it looks like someone could override this by providing a CLI argument to dpd. What do you think about seeing if we can put that CLI option behind a dev-only feature flag so that shipping code can't do this? (I'd of course talk to Levon about this before proceeding.)

(I'd do this as a follow up and just mention it in the note for now.)

Edit: maybe a simpler solution would be to remove these from the SMF start script, if that's possible? I noticed that the mgd -> dendrite dependency looks similar (can be overridden by a CLI argument), but the method script doesn't ever set the --mgs-addr option, which makes it pretty unlikely this would get broken in production.

Comment on lines 617 to 620
[[localhost_only_edges]]
server = "lldpd"
client = "dpd-client"
note = "verified: lldpd defaults to localhost for dpd (lldp dendrite.rs:194)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, it looks like you can configure lldpd to point to a specific dpd with a CLI option. Interestingly, that one is behind a feature flag. I wonder which one we ship with?

Comment on lines 637 to 639
# NOTE: The following sled-agent edges are BUGS - they use underlay IPs, not
# localhost. They are kept here to avoid breaking the build, but need to be
# fixed, either through client-side versioning or by some other means.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not seem accurate. At least, the second item down (L646-L649) is not labeled "BUG" like the others and does not seem like a bug.

Comment on lines 641 to 644
[[localhost_only_edges]]
server = "omicron-sled-agent"
client = "gateway-client"
note = "BUG: uses underlay IP, not localhost (early_networking.rs:376)"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appear to be two uses of gateway_client in sled-agent/src/bootstrap/early_networking.rs. The one Claude flagged actually seems fine, but there's a different one that it missed that looks problematic.

The one it flagged here is in init_switch_config(), where the comment and function imply that we are always talking about the local switch zone.

The one it missed is in lookup_switch_zone_underlay_addrs_one_attempt(), where the address comes from DNS, and so that could indeed wind up talking to a different sled's switch zone. That one does seem problematic.

@davepacheco
Copy link
Collaborator Author

Offline: I discussed with @sunshowers that, although I nudged us down this path, the complexity added here (both in terms of what the reader/author of the manifest has to deal with and the implementation) seems way more than I would have expected. I think it goes something like this:

  • We're trying to use filter rules to notate that certain edges are intra-deployment-unit only.
  • Prior to this PR, it is not allowed for a particular dependency path to match more than one filter rule, and none of them does. This is not strictly required but it was a choice I made early on to keep the system understandable and debuggable. (This avoids needing rules to define what happens when multiple rules match, having to understand how these rules fit together to achieve the author's intent, and needing to build tooling to dump out what's going on when a human is inevitably confused by what the tool did.)
  • However, the way rules are defined today, there necessarily is some overlap between this new kind of rule and some existing rules. The case we uncovered today was dpd:
dap@ivanova dendrite $ cargo tree -i gateway-client -p dpd
gateway-client v0.1.0 (https://github.com/oxidecomputer/omicron?branch=main#8e30605f)
├── dpd v0.2.0 (/home/dap/dendrite/dpd)
└── nexus-types v0.1.0 (https://github.com/oxidecomputer/omicron?branch=main#8e30605f)
    └── nexus-client v0.1.0 (https://github.com/oxidecomputer/omicron?branch=main#8e30605f)
        ├── dpd v0.2.0 (/home/dap/dendrite/dpd)
        └── oximeter-producer v0.1.0 (https://github.com/oxidecomputer/omicron?branch=main#8e30605f)
            └── dpd v0.2.0 (/home/dap/dendrite/dpd)

There are three paths here from dpd (the server component) to gateway-client (the client). The direct dependency dpd -> gateway_client is legitimate and needs to be marked intra-deployment-unit-only. The other two paths here are bogus and there's a rule marking them as such (any path through nexus-client to gateway-client gets marked bogus). However, there's no obvious way in metadata to discriminate between the legitimate path and the other two: all have the ancestor (or server) dpd and client gateway-client. So implementing this behavior with the existing filter rules requires relaxing the constraint that a path never match two rules.

That's okay, but how would we relax it? This PR essentially says: with this new kind of rule, we divide filter rules into two groups: those using ancestor and those using server. A path may match at most one of each group of rules. There's some non-obvious behavior in evaluating the filter to make this work and that's where lies a lot of the complexity that makes me uncomfortable:

  • there are now two categories of rules, separately from (but not orthogonal to) the evaluation
  • paths are now allowed to match multiple rules, with some non-obvious logic about when that's allowed and what happens when that happens.
  • there are now two types that logically represent the same thing, which is how we evaluate a particular API dependency (Evaluation/DependencyEvaluation)
  • what used to be a simple match is now a function call to the new type
  • there are now two types for RawDependencyFilterRule/DependencyFilterRule (this might be fine, but just noting it currently only exists to validate the two categories of rule)
  • should_include now takes both the server and the dep_path

One alternative I considered is to resolve this problem a different way: say that paths can match more than one rule only if all but one are bogus, and in that case, the bogus matches are ignored. This may be plausible, but as I went to prototype it, I realized that even adding an evaluation like intra-deployment-unit is a little goofy because of the way it overlaps with the existing dag evaluation. That is, the docs I was going to write would say "this is like dag, but also implies ...". Maybe instead we could make this a new property of the rule that only applies to dag rules? But anyway, all of this has made me feel like we're better off for now going back to 5e04994 (not treating these as filter rules, but as separate metadata that happens to look pretty similar). Sorry for the wild goose chase. I'll try to take a swing at this.

@davepacheco
Copy link
Collaborator Author

Okay, I've implemented the proposed code changes, along with pulling in some improvements @sunshowers made before I wound back. From my perspective what's left here is reviewing the specific metadata. I'll take a look at that tomorrow.

@davepacheco davepacheco marked this pull request as ready for review February 6, 2026 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ls-apis missed the invalid dependency between sled agent and dpd

3 participants