Skip to content

TQ: Support sled expunge via trust quorum pathway#9765

Draft
andrewjstone wants to merge 11 commits intomainfrom
tq-expunge
Draft

TQ: Support sled expunge via trust quorum pathway#9765
andrewjstone wants to merge 11 commits intomainfrom
tq-expunge

Conversation

@andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Jan 31, 2026

This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for the commit of the configuration with the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before.

The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that the last committed trust quorum configuration does not contain the sled that is to be expunged.

The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked.

This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in.

This also builds on #9741 and should merge after that PR.

I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.

Here's the external API calls:

```
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
  "error_code": "Not Found",
  "message": "No trust quorum configuration exists for this rack",
  "request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜  oxide.rs git:(main) ✗
➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
  "members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
  "state": "aborted",
  "time_aborted": "2026-01-29T01:54:02.590683Z",
  "time_committed": null,
  "time_created": "2026-01-29T01:37:07.476451Z",
  "unacknowledged_members": [
    {
      "part_number": "913-0000019",
      "serial_number": "20000000"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000001"
    },
    {
      "part_number": "913-0000019",
      "serial_number": "20000003"
    }
  ],
  "version": 2
}
```

Here's the omdb calls:

```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade

Caused by:
    Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        2,
    ),
    last_committed_epoch: None,
    state: Aborted,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T01:37:07.476451Z,
    time_committing: None,
    time_committed: None,
    time_aborted: Some(
        2026-01-29T01:54:02.590683Z,
    ),
    abort_reason: Some(
        "Aborted via API request",
    ),
}

root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: PreparingLrtqUpgrade,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: None,
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Unacked,
            share_digest: None,
            time_prepared: None,
            time_committed: None,
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: None,
    time_committed: None,
    time_aborted: None,
    abort_reason: None,
}

root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
    rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
    epoch: Epoch(
        3,
    ),
    last_committed_epoch: None,
    state: Committed,
    threshold: Threshold(
        2,
    ),
    commit_crash_tolerance: 0,
    coordinator: BaseboardId {
        part_number: "913-0000019",
        serial_number: "20000000",
    },
    encrypted_rack_secrets: Some(
        EncryptedRackSecrets {
            salt: Salt(
                [
                    143,
                    198,
                    3,
                    63,
                    136,
                    48,
                    212,
                    180,
                    101,
                    106,
                    50,
                    2,
                    251,
                    84,
                    234,
                    25,
                    46,
                    39,
                    139,
                    46,
                    29,
                    99,
                    252,
                    166,
                    76,
                    146,
                    78,
                    238,
                    28,
                    146,
                    191,
                    126,
                ],
            ),
            data: [
                167,
                223,
                29,
                18,
                50,
                230,
                103,
                71,
                159,
                77,
                118,
                39,
                173,
                97,
                16,
                92,
                27,
                237,
                125,
                173,
                53,
                51,
                96,
                242,
                203,
                70,
                36,
                188,
                200,
                59,
                251,
                53,
                126,
                48,
                182,
                141,
                216,
                162,
                240,
                5,
                4,
                255,
                145,
                106,
                97,
                62,
                91,
                161,
                51,
                110,
                220,
                16,
                132,
                29,
                147,
                60,
            ],
        },
    ),
    members: {
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000000",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.792674Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000001",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
            ),
            time_prepared: Some(
                2026-01-29T02:20:47.236089Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:49.503179Z,
            ),
        },
        BaseboardId {
            part_number: "913-0000019",
            serial_number: "20000003",
        }: TrustQuorumMemberData {
            state: Committed,
            share_digest: Some(
                sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
            ),
            time_prepared: Some(
                2026-01-29T02:20:46.809779Z,
            ),
            time_committed: Some(
                2026-01-29T02:21:52.248351Z,
            ),
        },
    },
    time_created: 2026-01-29T02:20:03.848507Z,
    time_committing: Some(
        2026-01-29T02:20:47.597276Z,
    ),
    time_committed: Some(
        2026-01-29T02:21:52.263198Z,
    ),
    time_aborted: None,
    abort_reason: None,
}
```
After chatting with @davepacheco, I changed the authz checks in the
datastore to do lookups with Rack scope. This fixed the test bug, but is
only a shortcut. Trust quorum should have it's own authz object and I"m
going to open an issue for that.

Additionally, for methods that already took an authorized connection, I
removed the unnecessary authz checks and opctx parameter.
This commit adds a 3 phase mechanism for sled expungement.

The first phase is to remove the sled from the latest trust quorum
configuration via omdb. The second phase is to reboot the sled after
polling for commit the trust quorum removal. The third phase is to
issue the existing omdb expunge command, which changes the sled policy
as before.

The first and second phases remove the need to physically remove the
sled before expungement. They act as a software mechanism that gates the
sled-agent from restarting on the sled and doing work when it should be
treated as "absent". We've discussed this numerous times in the update
huddle and it is finally arriving!

The third phase is what informs reconfigurator that the sled is gone
and remains the same except for an extra sanity check that that the last
committed trust quorum configuration does not contain the sled that is
to be expunged.

The removed sled may be added back to this rack or another after being
clean slated. I tested this by deleting the files in the internal
"cluster" and "config" directories and rebooting the removed sled in
a4x2 and it worked.

This PR is marked draft because it changes the current
sled-expunge pathway to depend on real trust quorum. We
cannot safely merge it in until the key-rotation work from
#9737 is merged in.

This also builds on #9741 and should merge after that PR.
@andrewjstone andrewjstone marked this pull request as draft January 31, 2026 00:20
@andrewjstone andrewjstone mentioned this pull request Jan 31, 2026
48 tasks
Base automatically changed from tq-abort to main February 3, 2026 00:49
@andrewjstone
Copy link
Contributor Author

Damn, Looks like the changes to expunge are breaking existing tests. I'll either need to update those tests, update the expunge function, or move the check inside the expunge function into omdb.

@andrewjstone
Copy link
Contributor Author

Damn, Looks like the changes to expunge are breaking existing tests. I'll either need to update those tests, update the expunge function, or move the check inside the expunge function into omdb.

Fixed by properly inserting fake tq during RSS handoff.

@andrewjstone
Copy link
Contributor Author

andrewjstone commented Feb 5, 2026

We actually can't merge this until R18 is out the door since it relies on having trust quorum configurations in order to perform expunge. We'll have to merge main into it once #9737 merges so we can do more hardware testing on racklettes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants