Skip to content

feat: GPU interconnect multinode support (chain SDK pieces)#315

Open
Zblocker64 wants to merge 13 commits into
mainfrom
feat/rdma-multinode
Open

feat: GPU interconnect multinode support (chain SDK pieces)#315
Zblocker64 wants to merge 13 commits into
mainfrom
feat/rdma-multinode

Conversation

@Zblocker64

@Zblocker64 Zblocker64 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Chain-SDK side of the GPU interconnect feature. Tenants declare gpu.attributes.interconnect: true (fabric-agnostic); the provider picks IB vs RoCE.

What's in here

  • Inventory v1NodeResources.GPUInterconnect + NodeCapabilities.{InterconnectResourceName, InterconnectFabric, NCCLHCAPrefix} for provider-side capacity tracking.
  • SDL parser — accepts gpu.attributes.interconnect (bool) and gpu.attributes.interconnect_group (string); emits both into on-chain Resources.GPU.Attributes so the provider bid engine can read them.
  • ManifestService.interconnect_group for off-chain workload-builder routing; omitempty so pre-feature SDLs hash identically.
  • Cross-field validation — interconnect requires capabilities/gpu-interconnect=true placement; interconnect_group requires interconnect: true; no implicit/explicit mixing within one placement. Enforced by both the Go and TS parsers.
  • TS bindings regenerated to match.

Out of scope

No on-chain proto messages change. No validator upgrade required — gpu-interconnect attribute keys flow through the existing flexible attribute slice.

Linear

AKT-401, 402, 403, 405, 406. AKT-404 (storage.shared) cancelled per decision #3.

@Zblocker64 Zblocker64 requested a review from a team as a code owner June 2, 2026 19:16
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds GPU interconnect support across the full stack: protobuf schemas extend NodeCapabilities, NodeResources, and Service with interconnect fields; ResourcePair.Dup() and NodeCapabilities.Dup() gain nil/zero-value safety; SDL YAML parsing emits interconnect/interconnect_group on-chain attributes; three cross-field validation rules are enforced in Go and TypeScript; and InterconnectGroup is propagated into generated manifests.

Changes

GPU Interconnect Feature

Layer / File(s) Summary
Protobuf schema extensions
proto/provider/akash/inventory/v1/node.proto, proto/provider/akash/inventory/v1/resources.proto, proto/provider/akash/manifest/v2beta3/service.proto
NodeCapabilities gains interconnect_resource_name, interconnect_fabric, and nccl_hca_prefix fields; NodeResources gains gpu_interconnect (ResourcePair); Service gains interconnect_group string with JSON/YAML omitempty.
Inventory deep-copy safety
go/inventory/v1/resourcepair.go, go/inventory/v1/node.go, go/inventory/v1/resources.go, go/inventory/v1/resourcepair_zero_test.go, go/inventory/v1/node_test.go, go/inventory/v1/resources_test.go
ResourcePair adds IsZero() and updates Dup() to safely handle nil/zero receivers and preserve nil-pointer shape per field. NodeCapabilities.Dup() copies three new interconnect string fields. NodeResources.Dup() copies GPUInterconnect. Tests cover zero, nil-receiver, populated round-trip, nil-pointer preservation, and zero-GPUInterconnect panic scenarios.
Deployment attribute preservation tests
go/node/deployment/v1beta4/interconnect_commit_audit_test.go
Regression tests verify interconnect attributes in GroupSpec.Requirements.Attributes and resource GPU.Attributes survive Dup(), remain accessible across four concrete ResourceGroup shapes, and are preserved through GetResourceUnits iteration.
SDL GPU parsing: interconnect attributes
go/sdl/gpu.go, go/sdl/sdl-input.schema.yaml, go/sdl/interconnect_gpu_test.go
Exports GPUAttributeInterconnect and GPUAttributeInterconnectGroup constants and adds v2ResourceGPU.InterconnectGroup field. v2GPUAttributes.UnmarshalYAML parses interconnect (bool) and interconnect_group (string) and emits sorted on-chain attributes. v2ResourceGPU.UnmarshalYAML lifts interconnect_group, defers validation, and rejects RDMA attributes when units are zero. Schema adds both new GPU attribute fields.
Manifest building: InterconnectGroup propagation
go/sdl/groupBuilder_v2.go, go/sdl/groupBuilder_v2_1.go, ts/src/sdl/manifest/generateManifest.ts
Go SDL group builders (v2 and v2_1) nil-guard and copy GPU.InterconnectGroup into manifest.Service.InterconnectGroup. TypeScript buildManifestService reads interconnect_group from GPU attributes and sets interconnectGroup in the protobuf payload.
Manifest JSON serialization
ts/src/sdl/manifest/generateManifestVersion.ts, ts/src/sdl/manifest/manifestUtils.ts
OMITTED_WHEN_EMPTY_STRING_KEYS is added and manifestReplacer omits interconnectGroup when empty string to keep the SHA-256 manifest hash stable. transformGpuAttributes is refactored to an imperative builder that emits interconnect/interconnect_group attributes and returns the full sorted result.
SDL cross-field interconnect validation
go/sdl/v2.go, go/sdl/v2_1.go, go/sdl/interconnect_validation_test.go, ts/src/sdl/validateSDL/validateSDL.ts, ts/src/sdl/validateSDL/validateSDL.spec.ts
validateInterconnect enforces three invariants in both Go (v2 and v2_1) and TypeScript: interconnect_group implies interconnect=true, interconnect=true profiles require capabilities/gpu-interconnect=true on the placement, and no per-placement mixing of explicit and implicit interconnect_group. Tests cover all rules in both languages.
Test fixtures
testdata/sdl/input/v2.0/gpu-interconnect-group/input.yaml, testdata/sdl/output-fixtures/v2.0/gpu-interconnect-group/...
New v2.0 SDL input fixture defines two GPU services sharing interconnect_group: pair0 under an Infiniband-capable placement. Corresponding group-specs.json and manifest.json fixtures capture the expected parsed and rendered outputs.

Sequence Diagram(s)

sequenceDiagram
  participant SDL
  participant GPUAttrs
  participant GPUResource
  participant Validator
  participant GroupBuilder
  participant Manifest
  SDL->>GPUAttrs: parse interconnect fields
  GPUAttrs-->>GPUResource: return sorted attributes
  GPUResource->>Validator: submit parsed gpu data
  Validator-->>GroupBuilder: validation passes
  GroupBuilder->>Manifest: copy interconnect group
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hoppity-hop through the code I go,
New interconnect fields all in a row!
The GPU clusters link paw-in-paw,
Validated rules without a flaw.
ResourcePair won't panic on nil—
This bunny's commits fit the bill! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description lacks structured sections required by the template: missing Purpose of Change checklist, Related Issues, and Checklist sections. Add missing template sections: Purpose of Change (with checkboxes), Related Issues (AKT-401 through AKT-406), Checklist items, and Notes for Reviewers to complete the description.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding GPU interconnect multinode support to the chain SDK, matching the primary theme across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rdma-multinode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/inventory/v1/resources.go`:
- Line 11: NodeResources.Dup() currently calls s.RDMA.Dup() unconditionally
which panics for zero-value RDMA; fix by making ResourcePair.Dup() zero-safe: in
ResourcePair.Dup() detect the zero/uninitialized receiver and return a
zero-value ResourcePair (or a safe initialized copy) instead of dereferencing
nil/internal fields, or alternatively change NodeResources.Dup() to guard the
call (e.g., if s.RDMA.IsZero() { copy.RDMA = ResourcePair{} } else { copy.RDMA =
s.RDMA.Dup() }); update ResourcePair.Dup() and/or add an IsZero helper so RDMA
is safe to duplicate without panics.

In `@go/sdl/v2.go`:
- Around line 561-566: The RDMA validation currently triggers based solely on
gpu.Attributes/rdma_group even for profiles with zero GPUs; update the logic in
the sdl.Profiles.Compute loop (the compute.Resources.GPU handling) to only
consider RDMA attributes or RDMAGroup when the GPU actually has Units > 0, or
alternatively reject rdma/rdma_group when gpu.Units == 0 in
v2ResourceGPU.UnmarshalYAML; specifically change the condition that calls
gpuAttributesHaveRDMA and checks gpu.RDMAGroup to also require gpu.Units > 0 (or
add a guard earlier to return an error if rdma/rdma_group is present while
gpu.Units == 0) so zero-unit GPU profiles do not appear RDMA-enabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7b5b662f-91e0-4b06-b1ef-7a719b1a350d

📥 Commits

Reviewing files that changed from the base of the PR and between bf983bc and 5ce7643.

⛔ Files ignored due to path filters (3)
  • go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
  • go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (14)
  • go/inventory/v1/node.go
  • go/inventory/v1/node_test.go
  • go/inventory/v1/resources.go
  • go/inventory/v1/resources_test.go
  • go/node/deployment/v1beta4/rdma_commit_audit_test.go
  • go/sdl/gpu.go
  • go/sdl/groupBuilder_v2.go
  • go/sdl/groupBuilder_v2_1.go
  • go/sdl/rdma_gpu_test.go
  • go/sdl/rdma_validation_test.go
  • go/sdl/v2.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • proto/provider/akash/manifest/v2beta3/service.proto

Comment thread go/inventory/v1/resources.go Outdated
Comment thread go/sdl/v2.go Outdated
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch 2 times, most recently from 6024641 to 5252ecc Compare June 2, 2026 19:31
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch 2 times, most recently from c0301cb to 4c19e6b Compare June 2, 2026 20:03
@github-actions github-actions Bot added the C:ts label Jun 2, 2026
Implements the akash-chain-sdk slice of the RDMA spec at
provider/_docs/infiniband-implementation-spec.md, covering Linear tickets
AKT-401..AKT-403, AKT-405, AKT-406 (CS-1, CS-2, CS-3, CS-5, CS-6). The
shared-storage track (CS-4) is intentionally dropped: per spec decision #3
every service gets its own RWO PVC and the provider does not produce
ReadWriteMany volumes.

Everything in this commit is off-chain — no on-chain proto messages change,
no validator upgrade required.

CS-1 — Inventory v1
       (proto/provider/akash/inventory/v1/{node,resources}.proto +
        go/inventory/v1/{node,resources}.pb.go regenerated via `make proto-gen-go`):
  * Add `ResourcePair rdma = 7` to `NodeResources`.
  * Add `rdma_resource_name`, `rdma_fabric`, `nccl_hca_prefix` strings to
    `NodeCapabilities` with gogoproto.customname annotations
    (`RDMAResourceName`, `RDMAFabric`, `NCCLHCAPrefix`).
  * Extend `Dup()` helpers in node.go and resources.go.
  * Tests: `node_test.go`, `resources_test.go` round-trip the new fields
    through Dup().

CS-2 — SDL parser: `gpu.attributes.rdma: true` (go/sdl/gpu.go):
  * Accept `rdma: true|false` under `gpu.attributes`. When true, emit a
    flat on-chain GPU attribute `rdma=true` so providers that advertise
    `capabilities/gpu/rdma=true` match.
  * Tests: `rdma_gpu_test.go::TestV2ResourceGPU_RDMAFlag` plus the existing
    `TestV2ResourceGPU` regression continues to pass.

CS-3 — SDL parser: `gpu.attributes.rdma_group: <name>` →
       off-chain manifest field
       (go/sdl/gpu.go,
        proto/provider/akash/manifest/v2beta3/service.proto regenerated to
        go/manifest/v2beta3/service.pb.go,
        go/sdl/groupBuilder_v2{,_1}.go):
  * Parser captures rdma_group via an internal sentinel attribute key
    (`__rdma_group__`) inside `v2GPUAttributes.UnmarshalYAML`. The parent
    `v2ResourceGPU.UnmarshalYAML` strips the sentinel before the slice ever
    reaches on-chain `Resources.GPU.attributes` and surfaces the value on
    a dedicated `v2ResourceGPU.RDMAGroup` field. `Validate()` is deferred
    to the parent so the sentinel can be removed before the
    attribute-key regex runs.
  * Manifest `Service.proto` gains `rdma_group = 11` with
    `gogoproto.customname = "RDMAGroup"`. Bindings regenerated via
    `make proto-gen-go`.
  * Both v2 / v2.1 group builders now read `compute.Resources.GPU.RDMAGroup`
    and propagate it onto `manifest.Service.RDMAGroup`.
  * Tests: `TestV2ResourceGPU_RDMAGroupRoutedOffChain` (asserts the
    sentinel never escapes) and `TestV2ResourceGPU_RDMAGroupOmitted`.

CS-5 — Parser cross-field validations
       (go/sdl/v2.go, validate() → new validateRDMA()):
  1. Any compute profile with `gpu.attributes.rdma: true` requires its
     placement attributes to include `capabilities/rdma=true`.
  2. Any compute profile with `gpu.attributes.rdma_group` set must also
     declare `gpu.attributes.rdma: true` on the same profile.
  3. Within one placement, no implicit-default-plus-explicit mixing: if
     any profile sets rdma_group, every RDMA-using profile must.
  Helpers `gpuAttributesHaveRDMA` and `placementRequiresRDMA` are kept
  package-local so the SDL parser owns the policy.
  * Tests: `rdma_validation_test.go` (6 positive + negative fixtures).

CS-6 — Reservation commit path audit
       (go/node/deployment/v1beta4/rdma_commit_audit_test.go):
  Table-driven regression test pinning down that `GroupSpec.Dup()` and the
  four concrete `ResourceGroup`-shaped values the provider's reservation
  path can hold (`*Group`, `Group`, `*GroupSpec`, `GroupSpec`) all preserve
  `Requirements.Attributes` (carrying `capabilities/rdma=true`) AND each
  resource's `GPU.Attributes` (carrying `rdma=true`). A future change that
  silently drops either slice — the exact failure mode the spec calls
  out — fails this test loudly.

All `.pb.go` files were regenerated via `make proto-gen-go` (buf v1.47.2,
protoc v29.1, gogoproto v1.7.2). Running `make proto-gen-go` against this
tree should be a no-op.

Tests (go test ./...):
  - pkg.akt.dev/go/inventory/v1                  PASS
  - pkg.akt.dev/go/manifest/v2beta3              PASS
  - pkg.akt.dev/go/node/deployment/v1beta4       PASS (+ new CS-6 audit)
  - pkg.akt.dev/go/sdl                           PASS (+ new CS-2/CS-3/CS-5)

Follow-ups for reviewers:
  - TypeScript bindings (`ts/`) are not touched here and need a separate
    `make proto-gen-ts` pass before the TS SDK consumes the new fields.

Linear: AKT-401, AKT-402, AKT-403, AKT-405, AKT-406
        (AKT-404 / CS-4 cancelled — see spec decision #3)

Fix make/setup-cache.mk: defer GOLANGCI_LINT_MAJOR computation to
recipe-execute time and depend on $(SEMVER), so the install isn't
given an empty major and the broken module path
.../golangci-lint/v/cmd/golangci-lint. This bug had `lint/go` red on
main for several commits prior.
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch from 4c19e6b to 5745d21 Compare June 2, 2026 20:14

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
go/inventory/v1/resourcepair_zero_test.go (1)

38-49: ⚡ Quick win

Add a regression test for partial-nil ResourcePair duplication.

Please add a case where one of Capacity/Allocatable/Allocated is nil and verify Dup() keeps that field nil after copy (instead of converting it to zero). This will prevent future contract regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/inventory/v1/resourcepair_zero_test.go` around lines 38 - 49, Add a
regression test that ensures ResourcePair.Dup() preserves nil fields: in
resourcepair_zero_test.go (e.g., extend or add a test alongside
TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/inventory/v1/resourcepair.go`:
- Around line 70-85: The Dup implementation in resourcepair.go currently always
returns non-nil pointers (&capacity etc.), losing nil presence; change it so
Capacity, Allocatable and Allocated are set to nil when the source
m.Capacity/m.Allocatable/m.Allocated are nil, and only allocate and assign a
DeepCopy pointer when the corresponding source field is non-nil (e.g., check
m.Capacity != nil then copy and set Capacity to that pointer, otherwise set
Capacity to nil); keep Attributes copied via m.Attributes.Dup() as is.

---

Nitpick comments:
In `@go/inventory/v1/resourcepair_zero_test.go`:
- Around line 38-49: Add a regression test that ensures ResourcePair.Dup()
preserves nil fields: in resourcepair_zero_test.go (e.g., extend or add a test
alongside TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89166079-2d14-4230-b011-b2b98830a3cb

📥 Commits

Reviewing files that changed from the base of the PR and between 5ce7643 and 060c02f.

⛔ Files ignored due to path filters (7)
  • go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
  • go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
  • ts/src/generated/protos/akash/inventory/v1/node.ts is excluded by !**/generated/**
  • ts/src/generated/protos/akash/inventory/v1/resources.ts is excluded by !**/generated/**
  • ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**
  • ts/src/sdl/manifest/__snapshots__/generateManifestVersion.spec.ts.snap is excluded by !**/*.snap
📒 Files selected for processing (29)
  • go/inventory/v1/node.go
  • go/inventory/v1/node_test.go
  • go/inventory/v1/resourcepair.go
  • go/inventory/v1/resourcepair_zero_test.go
  • go/inventory/v1/resources.go
  • go/inventory/v1/resources_test.go
  • go/node/deployment/v1beta4/rdma_commit_audit_test.go
  • go/sdl/gpu.go
  • go/sdl/groupBuilder_v2.go
  • go/sdl/groupBuilder_v2_1.go
  • go/sdl/rdma_gpu_test.go
  • go/sdl/rdma_validation_test.go
  • go/sdl/v2.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • proto/provider/akash/manifest/v2beta3/service.proto
  • testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
  • testdata/sdl/output-fixtures/v2.0/http-options/manifest.json
  • testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.0/multiple-services/manifest.json
  • testdata/sdl/output-fixtures/v2.0/persistent-storage/manifest.json
  • testdata/sdl/output-fixtures/v2.0/placement/manifest.json
  • testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
  • testdata/sdl/output-fixtures/v2.0/pricing/manifest.json
  • testdata/sdl/output-fixtures/v2.0/simple/manifest.json
  • testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
  • testdata/sdl/output-fixtures/v2.1/credentials/manifest.json
  • testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
✅ Files skipped from review due to trivial changes (10)
  • testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
  • go/inventory/v1/resources.go
  • testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
  • testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
  • testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
  • testdata/sdl/output-fixtures/v2.0/placement/manifest.json
  • testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.0/simple/manifest.json
  • testdata/sdl/output-fixtures/v2.0/pricing/manifest.json
🚧 Files skipped from review as they are similar to previous changes (12)
  • go/sdl/groupBuilder_v2_1.go
  • go/inventory/v1/node.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/manifest/v2beta3/service.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • go/inventory/v1/resources_test.go
  • go/sdl/groupBuilder_v2.go
  • go/node/deployment/v1beta4/rdma_commit_audit_test.go
  • go/inventory/v1/node_test.go
  • go/sdl/v2.go
  • go/sdl/gpu.go
  • go/sdl/rdma_validation_test.go

Comment thread go/inventory/v1/resourcepair.go Outdated
Comment thread go/sdl/groupBuilder_v2_1.go Outdated
Comment thread proto/provider/akash/manifest/v2beta3/service.proto Outdated
Comment thread go/sdl/gpu.go Outdated
Comment thread ts/src/generated/protos/akash/manifest/v2beta3/service.ts Outdated
Six review items from chalabi2 + CodeRabbit:

- proto/service.proto + regenerated .pb.go/.ts: add omitempty to
  Service.RDMAGroup JSON/YAML tags. The on-chain manifest version hash
  is a SHA over the JSON-serialized off-chain manifest; without
  omitempty every pre-RDMA service serialized "rdmaGroup": "" and
  shifted the hash for existing leases, breaking send-manifest
  validation. Fixtures regenerated — diff vs main is zero on every
  non-RDMA testdata manifest, confirming hash stability.

- sdl/v2.go + v2_1.go: extract validateRDMA() into a free function
  taking (profiles, deployments) and call it from both v2 and v2.1
  validate(). v2.1 inherits the full RDMA SDL grammar from v2 (parser
  promotes gpu.attributes.rdma + rdma_group through to the manifest),
  so the cross-field invariants (rule 1: rdma-required workload reaches
  rdma-capable provider; rule 2: rdma_group ⇒ rdma=true; rule 3: no
  implicit/explicit group mixing within one deployment) must apply
  symmetrically. Without this, invalid v2.1 SDLs bypass the new rules.

- sdl/sdl-input.schema.yaml: declare rdma (boolean) and rdma_group
  (string) under gpu.attributes. The parser accepts them but the schema
  had additionalProperties: false and only allowed `vendor`, rejecting
  valid RDMA SDLs at schema-validation time.

- inventory/v1/resourcepair.go: ResourcePair.Dup() now preserves the
  nil/non-nil shape of Capacity/Allocatable/Allocated rather than
  always returning &zeroQuantity. Returning non-nil pointers for
  originally-nil fields changes protobuf field-presence semantics and
  shifts the JSON serialization (which feeds the manifest hash).
  Regression-pinned by TestResourcePair_Dup_PreservesNilPointers.

- sdl/v2.go validateRDMA: defense-in-depth gate on gpu.Units > 0.
  The parser already rejects rdma/rdma_group on zero-GPU profiles, but
  the validator should not classify a zero-GPU profile as RDMA-enabled
  if that parser path is ever bypassed.

The TS proto-regen as part of #6 incidentally addresses the TS bindings
gap — ts/src/generated/protos/akash/manifest/v2beta3/service.ts now
parses and emits rdma_group with the same omitempty semantics.
CI sdl-parity job failed on the prior commit because the Go fixtures
(with omitempty) drop rdmaGroup for non-RDMA services, while TS's
manifestReplacer kept emitting "rdmaGroup":"". Mismatch on every v2.0
fixture.

manifestReplacer already had OMITTED_MANIFEST_KEYS for empty arrays /
zero numbers. Adding a parallel OMITTED_WHEN_EMPTY_STRING_KEYS set so
fields that use Go's `omitempty` semantics for string values can be
declared the same way on the TS side. rdmaGroup is the only entry for
now; future "omitempty string" fields just add their key here.

Updated the generateManifestVersion snapshot to match the corrected
serialization (no more "rdmaGroup":"" leaking into the hash input).
Prep for AKT-443 (provider bid-engine group-aware Adjust). The provider
needs to enforce per-rdma_group node separation at fit time — today the
workload builder's hostname pod anti-affinity is the only gate, which
fires after the bid has been accepted. The chain-SDK previously stripped
rdma_group from the GPU attribute slice and surfaced it only on the
off-chain Service.RDMAGroup field, so the bid engine had no signal.

Change: emit `rdma_group=<value>` as a regular on-chain GPU attribute
alongside `rdma=true`, while still lifting the value into
v2ResourceGPU.RDMAGroup → Service.RDMAGroup for the workload builder.
Both consumers see the same value via their respective paths.

- go/sdl/gpu.go: drop the gpuAttributeRDMAGroupSentinel layer. Emit
  rdma_group directly (the key matches the attribute key regex; no
  underscore prefix needed). Lift-but-keep instead of lift-and-strip.
- go/sdl/rdma_gpu_test.go: update assertions — the on-chain slice now
  CONTAINS rdma_group when set, AND v2ResourceGPU.RDMAGroup holds the
  same value. New name reflects the end-to-end flow.
- ts/src/sdl/manifest/manifestUtils.ts: transformGpuAttributes now
  emits both rdma and rdma_group keys, and sorts the result to match
  Go's sort.Sort(res) byte ordering.
- ts/src/sdl/manifest/generateManifest.ts: pass rdmaGroup from
  compute.resources.gpu.attributes.rdma_group into Service.fromPartial
  so it surfaces on the off-chain Service.rdmaGroup field.
- ts/src/sdl/validateSDL/validateSDLInput.ts: regenerated from the
  updated input schema (already had rdma + rdma_group properties from
  the prior commit).
- testdata/sdl/input/v2.0/gpu-rdma-group/: new fixture exercising the
  end-to-end path. Each service uses its own profile to sidestep an
  orthogonal Go-vs-TS resource-ID-assignment parity gap.

Rollout note: providers running pre-skip chain-sdk versions will reject
orders carrying the new rdma_group attribute (ParseGPUAttributes returns
"invalid GPU attribute"). The provider PR already lands the skip; any
provider wanting to bid on RDMA orders must upgrade past it.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
go/sdl/gpu.go (1)

156-160: ⚡ Quick win

Update stale sentinel comments to match current rdma_group flow.

Line 156 and Line 209 still describe a sentinel that gets stripped, but the code now keeps rdma_group in on-chain GPU attributes. Please align comments with actual behavior to avoid future mis-implementation.

Suggested comment-only diff
-			// gpu.attributes.rdma_group: string (peer group name). Captured
-			// here and emitted into the slice as a sentinel attribute that
-			// v2ResourceGPU.UnmarshalYAML strips before it reaches chain
-			// state. See gpuAttributeRDMAGroupSentinel.
+			// gpu.attributes.rdma_group: string (peer group name). Captured
+			// here and emitted directly as an on-chain GPU attribute
+			// (GPUAttributeRDMAGroup). v2ResourceGPU.UnmarshalYAML also lifts
+			// the same value into v2ResourceGPU.RDMAGroup for manifest routing.

-	// Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the
-	// rdma_group sentinel can be stripped from the slice before the
-	// attribute-key regex runs against it.
+	// Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the
+	// final assembled attribute slice is validated once in one place.

Also applies to: 209-211

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/sdl/gpu.go` around lines 156 - 160, The comments near the GPU YAML decode
(around the node.Content[i+1].Decode(&rdmaGroup) block) and the comment
referencing gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are
stale: they state rdma_group is emitted as a sentinel that
v2ResourceGPU.UnmarshalYAML strips, but the code now preserves rdma_group in
on-chain GPU attributes. Update those comments to reflect the current flow
(rdma_group is decoded and retained in the GPU attributes on-chain) and remove
or reword any mention of a sentinel being stripped; keep references to
gpuAttributeRDMAGroupSentinel and v2ResourceGPU.UnmarshalYAML only if describing
their actual behavior in the current implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts/src/sdl/manifest/manifestUtils.ts`:
- Around line 78-87: The code currently pushes attributes.rdma_group into the
on-chain GPU attributes (via the result.push({ key: "rdma_group", ... }) call),
but rdma_group must remain off-chain; remove the block that emits rdma_group
into Resources.GPU.Attributes in manifestUtils (i.e., delete or disable the if
(attributes.rdma_group && attributes.rdma_group.length > 0) result.push(...)
branch) and keep only the rdma boolean emission (the if (attributes.rdma ===
true) result.push({ key: "rdma", value: "true" }) branch); ensure any callers or
type definitions still treat attributes.rdma_group as an off-chain-only field
and do not rely on it being serialized into the result array.

---

Nitpick comments:
In `@go/sdl/gpu.go`:
- Around line 156-160: The comments near the GPU YAML decode (around the
node.Content[i+1].Decode(&rdmaGroup) block) and the comment referencing
gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are stale: they
state rdma_group is emitted as a sentinel that v2ResourceGPU.UnmarshalYAML
strips, but the code now preserves rdma_group in on-chain GPU attributes. Update
those comments to reflect the current flow (rdma_group is decoded and retained
in the GPU attributes on-chain) and remove or reword any mention of a sentinel
being stripped; keep references to gpuAttributeRDMAGroupSentinel and
v2ResourceGPU.UnmarshalYAML only if describing their actual behavior in the
current implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a3e93caa-5b29-4589-9819-2890ceac7099

📥 Commits

Reviewing files that changed from the base of the PR and between 060c02f and c1bbce4.

⛔ Files ignored due to path filters (2)
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
  • ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**
📒 Files selected for processing (15)
  • go/inventory/v1/resourcepair.go
  • go/inventory/v1/resourcepair_zero_test.go
  • go/sdl/gpu.go
  • go/sdl/rdma_gpu_test.go
  • go/sdl/sdl-input.schema.yaml
  • go/sdl/v2.go
  • go/sdl/v2_1.go
  • proto/provider/akash/manifest/v2beta3/service.proto
  • testdata/sdl/input/v2.0/gpu-rdma-group/input.yaml
  • testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json
  • testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/manifest.json
  • ts/src/sdl/manifest/generateManifest.ts
  • ts/src/sdl/manifest/generateManifestVersion.ts
  • ts/src/sdl/manifest/manifestUtils.ts
  • ts/src/sdl/validateSDL/validateSDLInput.ts
✅ Files skipped from review due to trivial changes (1)
  • testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • go/inventory/v1/resourcepair.go
  • go/sdl/v2.go

Comment thread ts/src/sdl/manifest/manifestUtils.ts Outdated
TS validateSDL had no RDMA semantic checks; tenants using the TS SDK
could broadcast SDLs that the Go parser would reject outright (chain
doesn't validate SDL semantics, so the failure would surface later as
broken pods).

Adds a #validateRDMA method on SDLValidator, called from validate()
after the per-service loop. Mirrors the Go-side validateRDMA in
go/sdl/v2.go and enforces the same three cross-field rules:

  1. A profile with gpu.attributes.rdma=true must be deployed under a
     placement whose attributes require capabilities/rdma=true.
  2. A profile with gpu.attributes.rdma_group set must also have
     gpu.attributes.rdma=true on the same profile.
  3. Within one deployment (placement), if any profile sets
     rdma_group, every rdma=true profile under that placement must
     also set rdma_group — no implicit-default-plus-explicit mixing.

The units==0 + rdma/rdma_group case is already rejected by the
schema-level gpuAttributesRequireUnitsGt0 rule (any attribute requires
units > 0), so no semantic check needed there. Pinned by two new tests
that assert the schema path, so a future schema relaxation can't
silently reopen the hole.

Five new tests cover the rejection paths plus a happy path. Existing
SDL parity tests stay at 34/34.
chalabi2
chalabi2 previously approved these changes Jun 9, 2026
Spec rework: tenants now request GPU interconnect as a single boolean
`gpu.attributes.interconnect: true`; the provider picks IB vs RoCE.
Removes fabric pin from the SDL surface; informational fabric value
still flows internally via NodeCapabilities.InterconnectFabric so
providers can advertise it for monitoring.

Full top-to-bottom rename:

Proto:
- akash.inventory.v1.NodeCapabilities.rdma_resource_name -> interconnect_resource_name
- akash.inventory.v1.NodeCapabilities.rdma_fabric -> interconnect_fabric
- akash.inventory.v1.NodeResources.rdma -> gpu_interconnect (customname: GPUInterconnect)
- akash.manifest.v2beta3.Service.rdma_group -> interconnect_group
  (proto field number 11 preserved; jsontag interconnectGroup,omitempty)

SDL (Go):
- gpu.attributes.rdma -> gpu.attributes.interconnect
- gpu.attributes.rdma_group -> gpu.attributes.interconnect_group
- capabilities/rdma placement key -> capabilities/gpu-interconnect
- GPUAttributeRDMA -> GPUAttributeInterconnect (const)
- GPUAttributeRDMAGroup -> GPUAttributeInterconnectGroup (const)
- v2ResourceGPU.RDMAGroup -> InterconnectGroup
- validateRDMA -> validateInterconnect
- gpuAttributesHaveRDMA -> gpuAttributesHaveInterconnect
- placementRequiresRDMA -> placementRequiresInterconnect
- groupBuilder_v2 + v2_1 propagate Service.InterconnectGroup

Inventory v1 (Go):
- NodeCapabilities.RDMAResourceName -> InterconnectResourceName
- NodeCapabilities.RDMAFabric -> InterconnectFabric
- NodeResources.RDMA -> GPUInterconnect

TS:
- transformGpuAttributes emits interconnect + interconnect_group keys
- generateManifest sets Service.interconnectGroup from
  compute.resources.gpu.attributes.interconnect_group
- generateManifestVersion OMITTED_WHEN_EMPTY_STRING_KEYS = {interconnectGroup}
- validateSDL.#validateInterconnect mirrors Go's rule 1/2/3 with the new keys
- sdl-input.schema.yaml declares interconnect (boolean) +
  interconnect_group (string); validateSDLInput.ts regenerated

Tests + fixtures:
- testdata/sdl/input/v2.0/gpu-rdma-group -> gpu-interconnect-group
- Renamed test files: rdma_gpu_test -> interconnect_gpu_test;
  rdma_validation_test -> interconnect_validation_test;
  rdma_commit_audit_test -> interconnect_commit_audit_test
- All identifier renames inside test bodies

Unavoidable external references kept:
- /sys/class/infiniband and rdma/rdma_shared_device_* — kernel sysfs +
  Mellanox/NVIDIA device-plugin resource names
- NCCL_IB_DISABLE/NCCL_IB_HCA env vars — NCCL upstream
- "infiniband" / "roce" fabric value strings — kernel link_layer values

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts/src/sdl/validateSDL/validateSDL.spec.ts`:
- Line 2270: Fix the spacing and indentation issues in the interconnect test
blocks that are causing TypeScript lint check failures. In
ts/src/sdl/validateSDL/validateSDL.spec.ts at lines 2270, 2280, 2326, and 2341,
review the `interconnectSetup()` function calls and the surrounding test code
for any extra spaces, misaligned indentation, or formatting inconsistencies, and
correct them to comply with the project's linting rules. Ensure consistent
spacing around parameters and proper indentation alignment throughout these test
cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 552b0ea7-ad28-4393-b001-95ba74712030

📥 Commits

Reviewing files that changed from the base of the PR and between 5bf26e0 and ffed18a.

⛔ Files ignored due to path filters (6)
  • go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
  • go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
  • ts/src/generated/protos/akash/inventory/v1/node.ts is excluded by !**/generated/**
  • ts/src/generated/protos/akash/inventory/v1/resources.ts is excluded by !**/generated/**
  • ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**
📒 Files selected for processing (27)
  • go/inventory/v1/node.go
  • go/inventory/v1/node_test.go
  • go/inventory/v1/resourcepair.go
  • go/inventory/v1/resourcepair_zero_test.go
  • go/inventory/v1/resources.go
  • go/inventory/v1/resources_test.go
  • go/node/deployment/v1beta4/interconnect_commit_audit_test.go
  • go/sdl/gpu.go
  • go/sdl/groupBuilder_v2.go
  • go/sdl/groupBuilder_v2_1.go
  • go/sdl/interconnect_gpu_test.go
  • go/sdl/interconnect_validation_test.go
  • go/sdl/sdl-input.schema.yaml
  • go/sdl/v2.go
  • go/sdl/v2_1.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • proto/provider/akash/manifest/v2beta3/service.proto
  • testdata/sdl/input/v2.0/gpu-interconnect-group/input.yaml
  • testdata/sdl/output-fixtures/v2.0/gpu-interconnect-group/group-specs.json
  • testdata/sdl/output-fixtures/v2.0/gpu-interconnect-group/manifest.json
  • ts/src/sdl/manifest/generateManifest.ts
  • ts/src/sdl/manifest/generateManifestVersion.ts
  • ts/src/sdl/manifest/manifestUtils.ts
  • ts/src/sdl/validateSDL/validateSDL.spec.ts
  • ts/src/sdl/validateSDL/validateSDL.ts
  • ts/src/sdl/validateSDL/validateSDLInput.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • go/inventory/v1/resourcepair.go

Comment thread ts/src/sdl/validateSDL/validateSDL.spec.ts Outdated
CodeRabbit lint follow-up on the rename pass — sed left a double space
after `{` in two interconnectSetup() calls and a stray leading space on
two `interconnect: true` lines inside the rule-3 fixture. Cosmetic only;
89/89 unit tests + 34/34 SDL parity tests still pass.
@Zblocker64 Zblocker64 changed the title feat: RDMA multinode GPU support (chain SDK pieces) feat: GPU interconnect multinode GPU support (chain SDK pieces) Jun 15, 2026
@Zblocker64 Zblocker64 changed the title feat: GPU interconnect multinode GPU support (chain SDK pieces) feat: GPU interconnect multinode support (chain SDK pieces) Jun 15, 2026
Comment thread proto/provider/akash/inventory/v1/node.proto
Comment thread testdata/sdl/input/v2.0/gpu-interconnect-group/input.yaml Outdated
Comment thread go/inventory/v1/node.go Outdated
Refactors the rc4 interconnect SDL surface from the flat
`interconnect: true` + `interconnect_group: <name>` pair to a single
nested key. Tenants write one of two forms under
`gpu.attributes.interconnect`:

  interconnect: []                  # implicit, parser assigns "auto"
  interconnect: { group: <name> }   # explicit named group

Bare boolean `interconnect: true` is retired; every opt-in now carries a
group label (a 1-member group is fine — anti-affinity becomes a no-op).
The on-chain wire format collapses to a single key
`interconnect/group = <name>` (slash-path matching the existing
`capabilities/gpu-interconnect/...` convention); the standalone
`interconnect = "true"` marker is dropped.

Also generalizes `nccl_hca_prefix: string` to
`nccl_hca_prefixes: repeated string` so mixed-vendor hosts (e.g.
mlx5 + bnxt_re on one node) can publish every HCA family; the workload
builder joins with commas for NCCL_IB_HCA which NCCL accepts natively.

Validator rules (Go + TS):
 1. Any opt-in requires placement attribute
    capabilities/gpu-interconnect=true.
 2. The name "auto" is reserved (parser only).
 3. Within one placement, no mixing of implicit and explicit forms.
 4. gpu.units==0 with any opt-in rejected (schema-level).

Spec: docs/sdl-interconnect-spec.md.

Breaking change vs rc4 — acceptable, no production tenants yet.
Cuts go/v0.2.13-rc5 + go/sdl/v0.2.2-rc5.

Linear: AKT-492
Autofix from `eslint --fix` — the AKT-492 commit added the
`resolveInterconnectGroup` import out of alphabetical order in the
manifestUtils.ts destructured import block. CI runs `eslint src`
(scoped) and this was the only error in scope.

No behavior change.
# Conflicts:
#	go/manifest/v2beta3/service.pb.go
#	ts/src/sdl/validateSDL/validateSDLInput.ts
Comment thread go/sdl/gpu.go Outdated
Zblocker64 added a commit that referenced this pull request Jun 19, 2026
Field was uppercase (InterconnectGroup) only by convention with sibling
fields, but the rest of the API design already made it package-internal
— the struct is unexported (v2ResourceGPU), and yaml:"-" json:"-"
tags excluded it from every serialization path.

Per Artur's review (chain-sdk PR #315): lowercase is the more honest
choice. All four call sites (gpu.UnmarshalYAML, v2.validateInterconnect,
groupBuilder_v2.go, groupBuilder_v2_1.go) are inside the sdl package
already, so the rename is internal-only with no API impact.

The constants (GPUAttributeInterconnectGroup, InterconnectGroupAuto)
and the off-chain Service.InterconnectGroup field stay uppercase —
they're exported by intent.

Tests still pass (sdl 91/91, parity 34/34).
Field was uppercase (InterconnectGroup) only by convention with sibling
fields, but the rest of the API design already made it package-internal
— the struct is unexported (v2ResourceGPU), and yaml:"-" json:"-"
tags excluded it from every serialization path.

Per Artur's review (chain-sdk PR #315): lowercase is the more honest
choice. All four call sites (gpu.UnmarshalYAML, v2.validateInterconnect,
groupBuilder_v2.go, groupBuilder_v2_1.go) are inside the sdl package
already, so the rename is internal-only with no API impact.

The constants (GPUAttributeInterconnectGroup, InterconnectGroupAuto)
and the off-chain Service.InterconnectGroup field stay uppercase —
they're exported by intent.

Tests still pass (sdl 91/91, parity 34/34).
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch from c84cb72 to 1d633ac Compare June 19, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants