feat: GPU interconnect multinode support (chain SDK pieces) by Zblocker64 · Pull Request #315 · akash-network/chain-sdk

Zblocker64 · 2026-06-02T19:16:50Z

Chain-SDK side of the GPU interconnect feature. Tenants declare gpu.attributes.interconnect: true (fabric-agnostic); the provider picks IB vs RoCE.

What's in here

Inventory v1 — NodeResources.GPUInterconnect + NodeCapabilities.{InterconnectResourceName, InterconnectFabric, NCCLHCAPrefix} for provider-side capacity tracking.
SDL parser — accepts gpu.attributes.interconnect (bool) and gpu.attributes.interconnect_group (string); emits both into on-chain Resources.GPU.Attributes so the provider bid engine can read them.
Manifest — Service.interconnect_group for off-chain workload-builder routing; omitempty so pre-feature SDLs hash identically.
Cross-field validation — interconnect requires capabilities/gpu-interconnect=true placement; interconnect_group requires interconnect: true; no implicit/explicit mixing within one placement. Enforced by both the Go and TS parsers.
TS bindings regenerated to match.

Out of scope

No on-chain proto messages change. No validator upgrade required — gpu-interconnect attribute keys flow through the existing flexible attribute slice.

Linear

AKT-401, 402, 403, 405, 406. AKT-404 (storage.shared) cancelled per decision #3.

coderabbitai · 2026-06-02T19:17:06Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds GPU interconnect support across the full stack: protobuf schemas extend NodeCapabilities, NodeResources, and Service with interconnect fields; ResourcePair.Dup() and NodeCapabilities.Dup() gain nil/zero-value safety; SDL YAML parsing emits interconnect/interconnect_group on-chain attributes; three cross-field validation rules are enforced in Go and TypeScript; and InterconnectGroup is propagated into generated manifests.

Changes

GPU Interconnect Feature

Layer / File(s)	Summary
Protobuf schema extensions `proto/provider/akash/inventory/v1/node.proto`, `proto/provider/akash/inventory/v1/resources.proto`, `proto/provider/akash/manifest/v2beta3/service.proto`	`NodeCapabilities` gains `interconnect_resource_name`, `interconnect_fabric`, and `nccl_hca_prefix` fields; `NodeResources` gains `gpu_interconnect` (ResourcePair); `Service` gains `interconnect_group` string with JSON/YAML `omitempty`.
Inventory deep-copy safety `go/inventory/v1/resourcepair.go`, `go/inventory/v1/node.go`, `go/inventory/v1/resources.go`, `go/inventory/v1/resourcepair_zero_test.go`, `go/inventory/v1/node_test.go`, `go/inventory/v1/resources_test.go`	`ResourcePair` adds `IsZero()` and updates `Dup()` to safely handle nil/zero receivers and preserve nil-pointer shape per field. `NodeCapabilities.Dup()` copies three new interconnect string fields. `NodeResources.Dup()` copies `GPUInterconnect`. Tests cover zero, nil-receiver, populated round-trip, nil-pointer preservation, and zero-GPUInterconnect panic scenarios.
Deployment attribute preservation tests `go/node/deployment/v1beta4/interconnect_commit_audit_test.go`	Regression tests verify interconnect attributes in `GroupSpec.Requirements.Attributes` and resource `GPU.Attributes` survive `Dup()`, remain accessible across four concrete `ResourceGroup` shapes, and are preserved through `GetResourceUnits` iteration.
SDL GPU parsing: interconnect attributes `go/sdl/gpu.go`, `go/sdl/sdl-input.schema.yaml`, `go/sdl/interconnect_gpu_test.go`	Exports `GPUAttributeInterconnect` and `GPUAttributeInterconnectGroup` constants and adds `v2ResourceGPU.InterconnectGroup` field. `v2GPUAttributes.UnmarshalYAML` parses `interconnect` (bool) and `interconnect_group` (string) and emits sorted on-chain attributes. `v2ResourceGPU.UnmarshalYAML` lifts `interconnect_group`, defers validation, and rejects RDMA attributes when units are zero. Schema adds both new GPU attribute fields.
Manifest building: InterconnectGroup propagation `go/sdl/groupBuilder_v2.go`, `go/sdl/groupBuilder_v2_1.go`, `ts/src/sdl/manifest/generateManifest.ts`	Go SDL group builders (v2 and v2_1) nil-guard and copy `GPU.InterconnectGroup` into `manifest.Service.InterconnectGroup`. TypeScript `buildManifestService` reads `interconnect_group` from GPU attributes and sets `interconnectGroup` in the protobuf payload.
Manifest JSON serialization `ts/src/sdl/manifest/generateManifestVersion.ts`, `ts/src/sdl/manifest/manifestUtils.ts`	`OMITTED_WHEN_EMPTY_STRING_KEYS` is added and `manifestReplacer` omits `interconnectGroup` when empty string to keep the SHA-256 manifest hash stable. `transformGpuAttributes` is refactored to an imperative builder that emits `interconnect`/`interconnect_group` attributes and returns the full sorted result.
SDL cross-field interconnect validation `go/sdl/v2.go`, `go/sdl/v2_1.go`, `go/sdl/interconnect_validation_test.go`, `ts/src/sdl/validateSDL/validateSDL.ts`, `ts/src/sdl/validateSDL/validateSDL.spec.ts`	`validateInterconnect` enforces three invariants in both Go (v2 and v2_1) and TypeScript: `interconnect_group` implies `interconnect=true`, `interconnect=true` profiles require `capabilities/gpu-interconnect=true` on the placement, and no per-placement mixing of explicit and implicit `interconnect_group`. Tests cover all rules in both languages.
Test fixtures `testdata/sdl/input/v2.0/gpu-interconnect-group/input.yaml`, `testdata/sdl/output-fixtures/v2.0/gpu-interconnect-group/...`	New v2.0 SDL input fixture defines two GPU services sharing `interconnect_group: pair0` under an Infiniband-capable placement. Corresponding `group-specs.json` and `manifest.json` fixtures capture the expected parsed and rendered outputs.

Sequence Diagram(s)

sequenceDiagram
  participant SDL
  participant GPUAttrs
  participant GPUResource
  participant Validator
  participant GroupBuilder
  participant Manifest
  SDL->>GPUAttrs: parse interconnect fields
  GPUAttrs-->>GPUResource: return sorted attributes
  GPUResource->>Validator: submit parsed gpu data
  Validator-->>GroupBuilder: validation passes
  GroupBuilder->>Manifest: copy interconnect group

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hoppity-hop through the code I go,
New interconnect fields all in a row!
The GPU clusters link paw-in-paw,
Validated rules without a flaw.
ResourcePair won't panic on nil—
This bunny's commits fit the bill! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description lacks structured sections required by the template: missing Purpose of Change checklist, Related Issues, and Checklist sections.	Add missing template sections: Purpose of Change (with checkboxes), Related Issues (AKT-401 through AKT-406), Checklist items, and Notes for Reviewers to complete the description.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding GPU interconnect multinode support to the chain SDK, matching the primary theme across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/rdma-multinode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/inventory/v1/resources.go`:
- Line 11: NodeResources.Dup() currently calls s.RDMA.Dup() unconditionally
which panics for zero-value RDMA; fix by making ResourcePair.Dup() zero-safe: in
ResourcePair.Dup() detect the zero/uninitialized receiver and return a
zero-value ResourcePair (or a safe initialized copy) instead of dereferencing
nil/internal fields, or alternatively change NodeResources.Dup() to guard the
call (e.g., if s.RDMA.IsZero() { copy.RDMA = ResourcePair{} } else { copy.RDMA =
s.RDMA.Dup() }); update ResourcePair.Dup() and/or add an IsZero helper so RDMA
is safe to duplicate without panics.

In `@go/sdl/v2.go`:
- Around line 561-566: The RDMA validation currently triggers based solely on
gpu.Attributes/rdma_group even for profiles with zero GPUs; update the logic in
the sdl.Profiles.Compute loop (the compute.Resources.GPU handling) to only
consider RDMA attributes or RDMAGroup when the GPU actually has Units > 0, or
alternatively reject rdma/rdma_group when gpu.Units == 0 in
v2ResourceGPU.UnmarshalYAML; specifically change the condition that calls
gpuAttributesHaveRDMA and checks gpu.RDMAGroup to also require gpu.Units > 0 (or
add a guard earlier to return an error if rdma/rdma_group is present while
gpu.Units == 0) so zero-unit GPU profiles do not appear RDMA-enabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7b5b662f-91e0-4b06-b1ef-7a719b1a350d

📥 Commits

Reviewing files that changed from the base of the PR and between bf983bc and 5ce7643.

⛔ Files ignored due to path filters (3)

go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (14)

go/inventory/v1/node.go
go/inventory/v1/node_test.go
go/inventory/v1/resources.go
go/inventory/v1/resources_test.go
go/node/deployment/v1beta4/rdma_commit_audit_test.go
go/sdl/gpu.go
go/sdl/groupBuilder_v2.go
go/sdl/groupBuilder_v2_1.go
go/sdl/rdma_gpu_test.go
go/sdl/rdma_validation_test.go
go/sdl/v2.go
proto/provider/akash/inventory/v1/node.proto
proto/provider/akash/inventory/v1/resources.proto
proto/provider/akash/manifest/v2beta3/service.proto

Implements the akash-chain-sdk slice of the RDMA spec at provider/_docs/infiniband-implementation-spec.md, covering Linear tickets AKT-401..AKT-403, AKT-405, AKT-406 (CS-1, CS-2, CS-3, CS-5, CS-6). The shared-storage track (CS-4) is intentionally dropped: per spec decision #3 every service gets its own RWO PVC and the provider does not produce ReadWriteMany volumes. Everything in this commit is off-chain — no on-chain proto messages change, no validator upgrade required. CS-1 — Inventory v1 (proto/provider/akash/inventory/v1/{node,resources}.proto + go/inventory/v1/{node,resources}.pb.go regenerated via `make proto-gen-go`): * Add `ResourcePair rdma = 7` to `NodeResources`. * Add `rdma_resource_name`, `rdma_fabric`, `nccl_hca_prefix` strings to `NodeCapabilities` with gogoproto.customname annotations (`RDMAResourceName`, `RDMAFabric`, `NCCLHCAPrefix`). * Extend `Dup()` helpers in node.go and resources.go. * Tests: `node_test.go`, `resources_test.go` round-trip the new fields through Dup(). CS-2 — SDL parser: `gpu.attributes.rdma: true` (go/sdl/gpu.go): * Accept `rdma: true|false` under `gpu.attributes`. When true, emit a flat on-chain GPU attribute `rdma=true` so providers that advertise `capabilities/gpu/rdma=true` match. * Tests: `rdma_gpu_test.go::TestV2ResourceGPU_RDMAFlag` plus the existing `TestV2ResourceGPU` regression continues to pass. CS-3 — SDL parser: `gpu.attributes.rdma_group: <name>` → off-chain manifest field (go/sdl/gpu.go, proto/provider/akash/manifest/v2beta3/service.proto regenerated to go/manifest/v2beta3/service.pb.go, go/sdl/groupBuilder_v2{,_1}.go): * Parser captures rdma_group via an internal sentinel attribute key (`__rdma_group__`) inside `v2GPUAttributes.UnmarshalYAML`. The parent `v2ResourceGPU.UnmarshalYAML` strips the sentinel before the slice ever reaches on-chain `Resources.GPU.attributes` and surfaces the value on a dedicated `v2ResourceGPU.RDMAGroup` field. `Validate()` is deferred to the parent so the sentinel can be removed before the attribute-key regex runs. * Manifest `Service.proto` gains `rdma_group = 11` with `gogoproto.customname = "RDMAGroup"`. Bindings regenerated via `make proto-gen-go`. * Both v2 / v2.1 group builders now read `compute.Resources.GPU.RDMAGroup` and propagate it onto `manifest.Service.RDMAGroup`. * Tests: `TestV2ResourceGPU_RDMAGroupRoutedOffChain` (asserts the sentinel never escapes) and `TestV2ResourceGPU_RDMAGroupOmitted`. CS-5 — Parser cross-field validations (go/sdl/v2.go, validate() → new validateRDMA()): 1. Any compute profile with `gpu.attributes.rdma: true` requires its placement attributes to include `capabilities/rdma=true`. 2. Any compute profile with `gpu.attributes.rdma_group` set must also declare `gpu.attributes.rdma: true` on the same profile. 3. Within one placement, no implicit-default-plus-explicit mixing: if any profile sets rdma_group, every RDMA-using profile must. Helpers `gpuAttributesHaveRDMA` and `placementRequiresRDMA` are kept package-local so the SDL parser owns the policy. * Tests: `rdma_validation_test.go` (6 positive + negative fixtures). CS-6 — Reservation commit path audit (go/node/deployment/v1beta4/rdma_commit_audit_test.go): Table-driven regression test pinning down that `GroupSpec.Dup()` and the four concrete `ResourceGroup`-shaped values the provider's reservation path can hold (`*Group`, `Group`, `*GroupSpec`, `GroupSpec`) all preserve `Requirements.Attributes` (carrying `capabilities/rdma=true`) AND each resource's `GPU.Attributes` (carrying `rdma=true`). A future change that silently drops either slice — the exact failure mode the spec calls out — fails this test loudly. All `.pb.go` files were regenerated via `make proto-gen-go` (buf v1.47.2, protoc v29.1, gogoproto v1.7.2). Running `make proto-gen-go` against this tree should be a no-op. Tests (go test ./...): - pkg.akt.dev/go/inventory/v1 PASS - pkg.akt.dev/go/manifest/v2beta3 PASS - pkg.akt.dev/go/node/deployment/v1beta4 PASS (+ new CS-6 audit) - pkg.akt.dev/go/sdl PASS (+ new CS-2/CS-3/CS-5) Follow-ups for reviewers: - TypeScript bindings (`ts/`) are not touched here and need a separate `make proto-gen-ts` pass before the TS SDK consumes the new fields. Linear: AKT-401, AKT-402, AKT-403, AKT-405, AKT-406 (AKT-404 / CS-4 cancelled — see spec decision #3) Fix make/setup-cache.mk: defer GOLANGCI_LINT_MAJOR computation to recipe-execute time and depend on $(SEMVER), so the install isn't given an empty major and the broken module path .../golangci-lint/v/cmd/golangci-lint. This bug had `lint/go` red on main for several commits prior.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

go/inventory/v1/resourcepair_zero_test.go (1)
38-49: ⚡ Quick win

Add a regression test for partial-nil ResourcePair duplication.

Please add a case where one of Capacity/Allocatable/Allocated is nil and verify Dup() keeps that field nil after copy (instead of converting it to zero). This will prevent future contract regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/inventory/v1/resourcepair_zero_test.go` around lines 38 - 49, Add a
regression test that ensures ResourcePair.Dup() preserves nil fields: in
resourcepair_zero_test.go (e.g., extend or add a test alongside
TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/inventory/v1/resourcepair.go`:
- Around line 70-85: The Dup implementation in resourcepair.go currently always
returns non-nil pointers (&capacity etc.), losing nil presence; change it so
Capacity, Allocatable and Allocated are set to nil when the source
m.Capacity/m.Allocatable/m.Allocated are nil, and only allocate and assign a
DeepCopy pointer when the corresponding source field is non-nil (e.g., check
m.Capacity != nil then copy and set Capacity to that pointer, otherwise set
Capacity to nil); keep Attributes copied via m.Attributes.Dup() as is.

---

Nitpick comments:
In `@go/inventory/v1/resourcepair_zero_test.go`:
- Around line 38-49: Add a regression test that ensures ResourcePair.Dup()
preserves nil fields: in resourcepair_zero_test.go (e.g., extend or add a test
alongside TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89166079-2d14-4230-b011-b2b98830a3cb

📥 Commits

Reviewing files that changed from the base of the PR and between 5ce7643 and 060c02f.

⛔ Files ignored due to path filters (7)

go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
ts/src/generated/protos/akash/inventory/v1/node.ts is excluded by !**/generated/**
ts/src/generated/protos/akash/inventory/v1/resources.ts is excluded by !**/generated/**
ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**
ts/src/sdl/manifest/__snapshots__/generateManifestVersion.spec.ts.snap is excluded by !**/*.snap

📒 Files selected for processing (29)

go/inventory/v1/node.go
go/inventory/v1/node_test.go
go/inventory/v1/resourcepair.go
go/inventory/v1/resourcepair_zero_test.go
go/inventory/v1/resources.go
go/inventory/v1/resources_test.go
go/node/deployment/v1beta4/rdma_commit_audit_test.go
go/sdl/gpu.go
go/sdl/groupBuilder_v2.go
go/sdl/groupBuilder_v2_1.go
go/sdl/rdma_gpu_test.go
go/sdl/rdma_validation_test.go
go/sdl/v2.go
proto/provider/akash/inventory/v1/node.proto
proto/provider/akash/inventory/v1/resources.proto
proto/provider/akash/manifest/v2beta3/service.proto
testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
testdata/sdl/output-fixtures/v2.0/http-options/manifest.json
testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
testdata/sdl/output-fixtures/v2.0/multiple-services/manifest.json
testdata/sdl/output-fixtures/v2.0/persistent-storage/manifest.json
testdata/sdl/output-fixtures/v2.0/placement/manifest.json
testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
testdata/sdl/output-fixtures/v2.0/pricing/manifest.json
testdata/sdl/output-fixtures/v2.0/simple/manifest.json
testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
testdata/sdl/output-fixtures/v2.1/credentials/manifest.json
testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json

✅ Files skipped from review due to trivial changes (10)

testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
go/inventory/v1/resources.go
testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
testdata/sdl/output-fixtures/v2.0/placement/manifest.json
testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
testdata/sdl/output-fixtures/v2.0/simple/manifest.json
testdata/sdl/output-fixtures/v2.0/pricing/manifest.json

🚧 Files skipped from review as they are similar to previous changes (12)

go/sdl/groupBuilder_v2_1.go
go/inventory/v1/node.go
proto/provider/akash/inventory/v1/node.proto
proto/provider/akash/manifest/v2beta3/service.proto
proto/provider/akash/inventory/v1/resources.proto
go/inventory/v1/resources_test.go
go/sdl/groupBuilder_v2.go
go/node/deployment/v1beta4/rdma_commit_audit_test.go
go/inventory/v1/node_test.go
go/sdl/v2.go
go/sdl/gpu.go
go/sdl/rdma_validation_test.go

Six review items from chalabi2 + CodeRabbit: - proto/service.proto + regenerated .pb.go/.ts: add omitempty to Service.RDMAGroup JSON/YAML tags. The on-chain manifest version hash is a SHA over the JSON-serialized off-chain manifest; without omitempty every pre-RDMA service serialized "rdmaGroup": "" and shifted the hash for existing leases, breaking send-manifest validation. Fixtures regenerated — diff vs main is zero on every non-RDMA testdata manifest, confirming hash stability. - sdl/v2.go + v2_1.go: extract validateRDMA() into a free function taking (profiles, deployments) and call it from both v2 and v2.1 validate(). v2.1 inherits the full RDMA SDL grammar from v2 (parser promotes gpu.attributes.rdma + rdma_group through to the manifest), so the cross-field invariants (rule 1: rdma-required workload reaches rdma-capable provider; rule 2: rdma_group ⇒ rdma=true; rule 3: no implicit/explicit group mixing within one deployment) must apply symmetrically. Without this, invalid v2.1 SDLs bypass the new rules. - sdl/sdl-input.schema.yaml: declare rdma (boolean) and rdma_group (string) under gpu.attributes. The parser accepts them but the schema had additionalProperties: false and only allowed `vendor`, rejecting valid RDMA SDLs at schema-validation time. - inventory/v1/resourcepair.go: ResourcePair.Dup() now preserves the nil/non-nil shape of Capacity/Allocatable/Allocated rather than always returning &zeroQuantity. Returning non-nil pointers for originally-nil fields changes protobuf field-presence semantics and shifts the JSON serialization (which feeds the manifest hash). Regression-pinned by TestResourcePair_Dup_PreservesNilPointers. - sdl/v2.go validateRDMA: defense-in-depth gate on gpu.Units > 0. The parser already rejects rdma/rdma_group on zero-GPU profiles, but the validator should not classify a zero-GPU profile as RDMA-enabled if that parser path is ever bypassed. The TS proto-regen as part of #6 incidentally addresses the TS bindings gap — ts/src/generated/protos/akash/manifest/v2beta3/service.ts now parses and emits rdma_group with the same omitempty semantics.

CI sdl-parity job failed on the prior commit because the Go fixtures (with omitempty) drop rdmaGroup for non-RDMA services, while TS's manifestReplacer kept emitting "rdmaGroup":"". Mismatch on every v2.0 fixture. manifestReplacer already had OMITTED_MANIFEST_KEYS for empty arrays / zero numbers. Adding a parallel OMITTED_WHEN_EMPTY_STRING_KEYS set so fields that use Go's `omitempty` semantics for string values can be declared the same way on the TS side. rdmaGroup is the only entry for now; future "omitempty string" fields just add their key here. Updated the generateManifestVersion snapshot to match the corrected serialization (no more "rdmaGroup":"" leaking into the hash input).

Prep for AKT-443 (provider bid-engine group-aware Adjust). The provider needs to enforce per-rdma_group node separation at fit time — today the workload builder's hostname pod anti-affinity is the only gate, which fires after the bid has been accepted. The chain-SDK previously stripped rdma_group from the GPU attribute slice and surfaced it only on the off-chain Service.RDMAGroup field, so the bid engine had no signal. Change: emit `rdma_group=<value>` as a regular on-chain GPU attribute alongside `rdma=true`, while still lifting the value into v2ResourceGPU.RDMAGroup → Service.RDMAGroup for the workload builder. Both consumers see the same value via their respective paths. - go/sdl/gpu.go: drop the gpuAttributeRDMAGroupSentinel layer. Emit rdma_group directly (the key matches the attribute key regex; no underscore prefix needed). Lift-but-keep instead of lift-and-strip. - go/sdl/rdma_gpu_test.go: update assertions — the on-chain slice now CONTAINS rdma_group when set, AND v2ResourceGPU.RDMAGroup holds the same value. New name reflects the end-to-end flow. - ts/src/sdl/manifest/manifestUtils.ts: transformGpuAttributes now emits both rdma and rdma_group keys, and sorts the result to match Go's sort.Sort(res) byte ordering. - ts/src/sdl/manifest/generateManifest.ts: pass rdmaGroup from compute.resources.gpu.attributes.rdma_group into Service.fromPartial so it surfaces on the off-chain Service.rdmaGroup field. - ts/src/sdl/validateSDL/validateSDLInput.ts: regenerated from the updated input schema (already had rdma + rdma_group properties from the prior commit). - testdata/sdl/input/v2.0/gpu-rdma-group/: new fixture exercising the end-to-end path. Each service uses its own profile to sidestep an orthogonal Go-vs-TS resource-ID-assignment parity gap. Rollout note: providers running pre-skip chain-sdk versions will reject orders carrying the new rdma_group attribute (ParseGPUAttributes returns "invalid GPU attribute"). The provider PR already lands the skip; any provider wanting to bid on RDMA orders must upgrade past it.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

go/sdl/gpu.go (1)

156-160: ⚡ Quick win

Update stale sentinel comments to match current rdma_group flow.

Line 156 and Line 209 still describe a sentinel that gets stripped, but the code now keeps rdma_group in on-chain GPU attributes. Please align comments with actual behavior to avoid future mis-implementation.

Suggested comment-only diff

-			// gpu.attributes.rdma_group: string (peer group name). Captured
-			// here and emitted into the slice as a sentinel attribute that
-			// v2ResourceGPU.UnmarshalYAML strips before it reaches chain
-			// state. See gpuAttributeRDMAGroupSentinel.
+			// gpu.attributes.rdma_group: string (peer group name). Captured
+			// here and emitted directly as an on-chain GPU attribute
+			// (GPUAttributeRDMAGroup). v2ResourceGPU.UnmarshalYAML also lifts
+			// the same value into v2ResourceGPU.RDMAGroup for manifest routing.

-	// Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the
-	// rdma_group sentinel can be stripped from the slice before the
-	// attribute-key regex runs against it.
+	// Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the
+	// final assembled attribute slice is validated once in one place.

Also applies to: 209-211

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/sdl/gpu.go` around lines 156 - 160, The comments near the GPU YAML decode
(around the node.Content[i+1].Decode(&rdmaGroup) block) and the comment
referencing gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are
stale: they state rdma_group is emitted as a sentinel that
v2ResourceGPU.UnmarshalYAML strips, but the code now preserves rdma_group in
on-chain GPU attributes. Update those comments to reflect the current flow
(rdma_group is decoded and retained in the GPU attributes on-chain) and remove
or reword any mention of a sentinel being stripped; keep references to
gpuAttributeRDMAGroupSentinel and v2ResourceGPU.UnmarshalYAML only if describing
their actual behavior in the current implementation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts/src/sdl/manifest/manifestUtils.ts`:
- Around line 78-87: The code currently pushes attributes.rdma_group into the
on-chain GPU attributes (via the result.push({ key: "rdma_group", ... }) call),
but rdma_group must remain off-chain; remove the block that emits rdma_group
into Resources.GPU.Attributes in manifestUtils (i.e., delete or disable the if
(attributes.rdma_group && attributes.rdma_group.length > 0) result.push(...)
branch) and keep only the rdma boolean emission (the if (attributes.rdma ===
true) result.push({ key: "rdma", value: "true" }) branch); ensure any callers or
type definitions still treat attributes.rdma_group as an off-chain-only field
and do not rely on it being serialized into the result array.

---

Nitpick comments:
In `@go/sdl/gpu.go`:
- Around line 156-160: The comments near the GPU YAML decode (around the
node.Content[i+1].Decode(&rdmaGroup) block) and the comment referencing
gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are stale: they
state rdma_group is emitted as a sentinel that v2ResourceGPU.UnmarshalYAML
strips, but the code now preserves rdma_group in on-chain GPU attributes. Update
those comments to reflect the current flow (rdma_group is decoded and retained
in the GPU attributes on-chain) and remove or reword any mention of a sentinel
being stripped; keep references to gpuAttributeRDMAGroupSentinel and
v2ResourceGPU.UnmarshalYAML only if describing their actual behavior in the
current implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a3e93caa-5b29-4589-9819-2890ceac7099

📥 Commits

Reviewing files that changed from the base of the PR and between 060c02f and c1bbce4.

⛔ Files ignored due to path filters (2)

go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**

📒 Files selected for processing (15)

go/inventory/v1/resourcepair.go
go/inventory/v1/resourcepair_zero_test.go
go/sdl/gpu.go
go/sdl/rdma_gpu_test.go
go/sdl/sdl-input.schema.yaml
go/sdl/v2.go
go/sdl/v2_1.go
proto/provider/akash/manifest/v2beta3/service.proto
testdata/sdl/input/v2.0/gpu-rdma-group/input.yaml
testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json
testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/manifest.json
ts/src/sdl/manifest/generateManifest.ts
ts/src/sdl/manifest/generateManifestVersion.ts
ts/src/sdl/manifest/manifestUtils.ts
ts/src/sdl/validateSDL/validateSDLInput.ts

✅ Files skipped from review due to trivial changes (1)

testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json

🚧 Files skipped from review as they are similar to previous changes (2)

go/inventory/v1/resourcepair.go
go/sdl/v2.go

TS validateSDL had no RDMA semantic checks; tenants using the TS SDK could broadcast SDLs that the Go parser would reject outright (chain doesn't validate SDL semantics, so the failure would surface later as broken pods). Adds a #validateRDMA method on SDLValidator, called from validate() after the per-service loop. Mirrors the Go-side validateRDMA in go/sdl/v2.go and enforces the same three cross-field rules: 1. A profile with gpu.attributes.rdma=true must be deployed under a placement whose attributes require capabilities/rdma=true. 2. A profile with gpu.attributes.rdma_group set must also have gpu.attributes.rdma=true on the same profile. 3. Within one deployment (placement), if any profile sets rdma_group, every rdma=true profile under that placement must also set rdma_group — no implicit-default-plus-explicit mixing. The units==0 + rdma/rdma_group case is already rejected by the schema-level gpuAttributesRequireUnitsGt0 rule (any attribute requires units > 0), so no semantic check needed there. Pinned by two new tests that assert the schema path, so a future schema relaxation can't silently reopen the hole. Five new tests cover the rejection paths plus a happy path. Existing SDL parity tests stay at 34/34.

Spec rework: tenants now request GPU interconnect as a single boolean `gpu.attributes.interconnect: true`; the provider picks IB vs RoCE. Removes fabric pin from the SDL surface; informational fabric value still flows internally via NodeCapabilities.InterconnectFabric so providers can advertise it for monitoring. Full top-to-bottom rename: Proto: - akash.inventory.v1.NodeCapabilities.rdma_resource_name -> interconnect_resource_name - akash.inventory.v1.NodeCapabilities.rdma_fabric -> interconnect_fabric - akash.inventory.v1.NodeResources.rdma -> gpu_interconnect (customname: GPUInterconnect) - akash.manifest.v2beta3.Service.rdma_group -> interconnect_group (proto field number 11 preserved; jsontag interconnectGroup,omitempty) SDL (Go): - gpu.attributes.rdma -> gpu.attributes.interconnect - gpu.attributes.rdma_group -> gpu.attributes.interconnect_group - capabilities/rdma placement key -> capabilities/gpu-interconnect - GPUAttributeRDMA -> GPUAttributeInterconnect (const) - GPUAttributeRDMAGroup -> GPUAttributeInterconnectGroup (const) - v2ResourceGPU.RDMAGroup -> InterconnectGroup - validateRDMA -> validateInterconnect - gpuAttributesHaveRDMA -> gpuAttributesHaveInterconnect - placementRequiresRDMA -> placementRequiresInterconnect - groupBuilder_v2 + v2_1 propagate Service.InterconnectGroup Inventory v1 (Go): - NodeCapabilities.RDMAResourceName -> InterconnectResourceName - NodeCapabilities.RDMAFabric -> InterconnectFabric - NodeResources.RDMA -> GPUInterconnect TS: - transformGpuAttributes emits interconnect + interconnect_group keys - generateManifest sets Service.interconnectGroup from compute.resources.gpu.attributes.interconnect_group - generateManifestVersion OMITTED_WHEN_EMPTY_STRING_KEYS = {interconnectGroup} - validateSDL.#validateInterconnect mirrors Go's rule 1/2/3 with the new keys - sdl-input.schema.yaml declares interconnect (boolean) + interconnect_group (string); validateSDLInput.ts regenerated Tests + fixtures: - testdata/sdl/input/v2.0/gpu-rdma-group -> gpu-interconnect-group - Renamed test files: rdma_gpu_test -> interconnect_gpu_test; rdma_validation_test -> interconnect_validation_test; rdma_commit_audit_test -> interconnect_commit_audit_test - All identifier renames inside test bodies Unavoidable external references kept: - /sys/class/infiniband and rdma/rdma_shared_device_* — kernel sysfs + Mellanox/NVIDIA device-plugin resource names - NCCL_IB_DISABLE/NCCL_IB_HCA env vars — NCCL upstream - "infiniband" / "roce" fabric value strings — kernel link_layer values

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts/src/sdl/validateSDL/validateSDL.spec.ts`:
- Line 2270: Fix the spacing and indentation issues in the interconnect test
blocks that are causing TypeScript lint check failures. In
ts/src/sdl/validateSDL/validateSDL.spec.ts at lines 2270, 2280, 2326, and 2341,
review the `interconnectSetup()` function calls and the surrounding test code
for any extra spaces, misaligned indentation, or formatting inconsistencies, and
correct them to comply with the project's linting rules. Ensure consistent
spacing around parameters and proper indentation alignment throughout these test
cases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 552b0ea7-ad28-4393-b001-95ba74712030

📥 Commits

Reviewing files that changed from the base of the PR and between 5bf26e0 and ffed18a.

⛔ Files ignored due to path filters (6)

go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
ts/src/generated/protos/akash/inventory/v1/node.ts is excluded by !**/generated/**
ts/src/generated/protos/akash/inventory/v1/resources.ts is excluded by !**/generated/**
ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**

📒 Files selected for processing (27)

go/inventory/v1/node.go
go/inventory/v1/node_test.go
go/inventory/v1/resourcepair.go
go/inventory/v1/resourcepair_zero_test.go
go/inventory/v1/resources.go
go/inventory/v1/resources_test.go
go/node/deployment/v1beta4/interconnect_commit_audit_test.go
go/sdl/gpu.go
go/sdl/groupBuilder_v2.go
go/sdl/groupBuilder_v2_1.go
go/sdl/interconnect_gpu_test.go
go/sdl/interconnect_validation_test.go
go/sdl/sdl-input.schema.yaml
go/sdl/v2.go
go/sdl/v2_1.go
proto/provider/akash/inventory/v1/node.proto
proto/provider/akash/inventory/v1/resources.proto
proto/provider/akash/manifest/v2beta3/service.proto
testdata/sdl/input/v2.0/gpu-interconnect-group/input.yaml
testdata/sdl/output-fixtures/v2.0/gpu-interconnect-group/group-specs.json
testdata/sdl/output-fixtures/v2.0/gpu-interconnect-group/manifest.json
ts/src/sdl/manifest/generateManifest.ts
ts/src/sdl/manifest/generateManifestVersion.ts
ts/src/sdl/manifest/manifestUtils.ts
ts/src/sdl/validateSDL/validateSDL.spec.ts
ts/src/sdl/validateSDL/validateSDL.ts
ts/src/sdl/validateSDL/validateSDLInput.ts

🚧 Files skipped from review as they are similar to previous changes (1)

go/inventory/v1/resourcepair.go

CodeRabbit lint follow-up on the rename pass — sed left a double space after `{` in two interconnectSetup() calls and a stray leading space on two `interconnect: true` lines inside the rule-3 fixture. Cosmetic only; 89/89 unit tests + 34/34 SDL parity tests still pass.

Refactors the rc4 interconnect SDL surface from the flat `interconnect: true` + `interconnect_group: <name>` pair to a single nested key. Tenants write one of two forms under `gpu.attributes.interconnect`: interconnect: [] # implicit, parser assigns "auto" interconnect: { group: <name> } # explicit named group Bare boolean `interconnect: true` is retired; every opt-in now carries a group label (a 1-member group is fine — anti-affinity becomes a no-op). The on-chain wire format collapses to a single key `interconnect/group = <name>` (slash-path matching the existing `capabilities/gpu-interconnect/...` convention); the standalone `interconnect = "true"` marker is dropped. Also generalizes `nccl_hca_prefix: string` to `nccl_hca_prefixes: repeated string` so mixed-vendor hosts (e.g. mlx5 + bnxt_re on one node) can publish every HCA family; the workload builder joins with commas for NCCL_IB_HCA which NCCL accepts natively. Validator rules (Go + TS): 1. Any opt-in requires placement attribute capabilities/gpu-interconnect=true. 2. The name "auto" is reserved (parser only). 3. Within one placement, no mixing of implicit and explicit forms. 4. gpu.units==0 with any opt-in rejected (schema-level). Spec: docs/sdl-interconnect-spec.md. Breaking change vs rc4 — acceptable, no production tenants yet. Cuts go/v0.2.13-rc5 + go/sdl/v0.2.2-rc5. Linear: AKT-492

Autofix from `eslint --fix` — the AKT-492 commit added the `resolveInterconnectGroup` import out of alphabetical order in the manifestUtils.ts destructured import block. CI runs `eslint src` (scoped) and this was the only error in scope. No behavior change.

# Conflicts: # go/manifest/v2beta3/service.pb.go # ts/src/sdl/validateSDL/validateSDLInput.ts

Field was uppercase (InterconnectGroup) only by convention with sibling fields, but the rest of the API design already made it package-internal — the struct is unexported (v2ResourceGPU), and yaml:"-" json:"-" tags excluded it from every serialization path. Per Artur's review (chain-sdk PR #315): lowercase is the more honest choice. All four call sites (gpu.UnmarshalYAML, v2.validateInterconnect, groupBuilder_v2.go, groupBuilder_v2_1.go) are inside the sdl package already, so the rename is internal-only with no API impact. The constants (GPUAttributeInterconnectGroup, InterconnectGroupAuto) and the off-chain Service.InterconnectGroup field stay uppercase — they're exported by intent. Tests still pass (sdl 91/91, parity 34/34).

Zblocker64 requested a review from a team as a code owner June 2, 2026 19:16

github-actions Bot added C:go/node C:go/manifest C:proto/provider C:go/sdl C:go/inventory labels Jun 2, 2026

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread go/inventory/v1/resources.go Outdated

Comment thread go/sdl/v2.go Outdated

Zblocker64 force-pushed the feat/rdma-multinode branch 2 times, most recently from 6024641 to 5252ecc Compare June 2, 2026 19:31

github-actions Bot added the Type: Build label Jun 2, 2026

Zblocker64 force-pushed the feat/rdma-multinode branch 2 times, most recently from c0301cb to 4c19e6b Compare June 2, 2026 20:03

github-actions Bot added the C:ts label Jun 2, 2026

Zblocker64 force-pushed the feat/rdma-multinode branch from 4c19e6b to 5745d21 Compare June 2, 2026 20:14

Merge branch 'main' into feat/rdma-multinode

060c02f

github-actions Bot removed the Type: Build label Jun 8, 2026

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread go/inventory/v1/resourcepair.go Outdated

chalabi2 requested changes Jun 8, 2026

View reviewed changes

Comment thread go/sdl/groupBuilder_v2_1.go Outdated

Comment thread proto/provider/akash/manifest/v2beta3/service.proto Outdated

Comment thread go/sdl/gpu.go Outdated

Comment thread ts/src/generated/protos/akash/manifest/v2beta3/service.ts Outdated

Zblocker64 added 3 commits June 8, 2026 13:08

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread ts/src/sdl/manifest/manifestUtils.ts Outdated

chalabi2 previously approved these changes Jun 9, 2026

View reviewed changes

Zblocker64 added 2 commits June 15, 2026 14:26

Merge remote-tracking branch 'origin/main' into feat/rdma-multinode

ffed18a

Zblocker64 dismissed chalabi2’s stale review via ffed18a June 15, 2026 19:28

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread ts/src/sdl/validateSDL/validateSDL.spec.ts Outdated

Zblocker64 changed the title ~~feat: RDMA multinode GPU support (chain SDK pieces)~~ feat: GPU interconnect multinode GPU support (chain SDK pieces) Jun 15, 2026

Zblocker64 changed the title ~~feat: GPU interconnect multinode GPU support (chain SDK pieces)~~ feat: GPU interconnect multinode support (chain SDK pieces) Jun 15, 2026

troian requested changes Jun 16, 2026

View reviewed changes

Comment thread proto/provider/akash/inventory/v1/node.proto

Comment thread testdata/sdl/input/v2.0/gpu-interconnect-group/input.yaml Outdated

Comment thread go/inventory/v1/node.go Outdated

github-actions Bot added the Typ: Docs label Jun 16, 2026

Zblocker64 added 2 commits June 16, 2026 11:39

Merge remote-tracking branch 'origin/main' into feat/rdma-multinode

7baac95

# Conflicts: # go/manifest/v2beta3/service.pb.go # ts/src/sdl/validateSDL/validateSDLInput.ts

troian reviewed Jun 16, 2026

View reviewed changes

Comment thread go/sdl/gpu.go Outdated

Zblocker64 force-pushed the feat/rdma-multinode branch from c84cb72 to 1d633ac Compare June 19, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU interconnect multinode support (chain SDK pieces)#315

feat: GPU interconnect multinode support (chain SDK pieces)#315
Zblocker64 wants to merge 13 commits into
mainfrom
feat/rdma-multinode

Zblocker64 commented Jun 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Zblocker64 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in here

Out of scope

Linear

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zblocker64 commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading