Skip to content

Design doc for #1933: Issue #1933 Design: GPU support#1936

Open
johnynek wants to merge 10 commits into
mainfrom
agent/design/1933-design-for-gpu-support
Open

Design doc for #1933: Issue #1933 Design: GPU support#1936
johnynek wants to merge 10 commits into
mainfrom
agent/design/1933-design-for-gpu-support

Conversation

@johnynek

@johnynek johnynek commented Mar 2, 2026

Copy link
Copy Markdown
Owner

Design doc.

Refs #1933


external struct Session

external list_devices: Prog[GPUError, Array[DeviceInfo]]

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think GPU operations are effects, generally, are they? They are more like pure functions that can fail. So maybe we want a GPU[_] monad which we can run in a pure context to an enum GPUResult[a]: GPUFailed(err: GPUError), GPUSuccess(result: a) something like that.

Would that work?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in-doc: I switched the API model from Prog[GPUError, ...] to a dedicated GPU[_] plus GPUResult[_] with run_GPU, and updated the core/tensor API sketches and example accordingly. This now models GPU programs as pure descriptions that are interpreted explicitly to GPUResult, matching your suggestion.

external struct Tensor1_Float64
external struct Tensor2_Float64
external struct Tensor3_Float64
external struct Tensor4_Float64

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why these need to be non generic. It's a bit ugly not to have Tensor1[a].

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new subsection right after the tensor API/notes: Why not Tensor1[a] in v1?. It now explains why v1 uses concrete tensor types (Int/Float64) for kernel coverage clarity, current externals/typeclass limitations across runtimes, pack-mode specificity, and FFI ABI stability.

4. Wire build and install in `c_runtime/Makefile`.
5. Add command-level coverage in `core/src/test/scala/dev/bosatsu/ToolAndLibCommandTest.scala` forcing CPU backend.

### Phase 3: Metal and CUDA backends

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain how this interacts with ./bosatsu c-runtime install. Do we add flags for the GPU support there? I guess so.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dedicated subsection under Backend architecture: ``bosatsu c-runtime install integration. It now explains proposed install/build flags (`--gpu-backends`, `--cuda-home`, `--metal-sdk`, `--gpu-strict`) and how requested backends are forwarded to `c_runtime/Makefile` with CPU-safe fallback behavior.

@codecov

codecov Bot commented Mar 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.05%. Comparing base (f77a22f) to head (f7a40ad).

⚠️ Current head f7a40ad differs from pull request most recent head f54ae30

Please upload reports for the commit f54ae30 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1936      +/-   ##
==========================================
+ Coverage   85.02%   85.05%   +0.03%     
==========================================
  Files         195      190       -5     
  Lines       49216    47950    -1266     
  Branches    12234    12018     -216     
==========================================
- Hits        41847    40785    -1062     
+ Misses       7369     7165     -204     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@johnynek

johnynek commented Mar 2, 2026

Copy link
Copy Markdown
Owner Author

Updated docs/design/1933-design-for-gpu-support.md to address all three review threads directly in the design. Also ran scripts/test_basic.sh per MergeXO reminder; it passed.

@johnynek

johnynek commented Mar 2, 2026

Copy link
Copy Markdown
Owner Author

No new line-level feedback to apply. I reran the required local pre-push check in this turn: scripts/test_basic.sh passed (Passed: Total 61, Failed 0, Errors 0). The reported GitHub Actions failure for run 22595679770 is an external action-fetch failure (actions/setup-java tarball URI not found) before project steps run, so it is not reproducible as a repository code failure and not fixable by changes in this PR. No file edits were made in this turn.

@johnynek

Copy link
Copy Markdown
Owner Author

Merged origin/main into the PR branch via MergeXO and rechecked the refreshed checkout. scripts/test_basic.sh passed locally after the merge (Passed: Total 67, Failed 0, Errors 0). No additional edits to docs/design/1933-design-for-gpu-support.md were needed after the refresh.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO’s origin/main refresh is applied cleanly in this checkout. I reran scripts/test_basic.sh after the merge and it passed locally (Passed: Total 67, Failed 0, Errors 0). No additional edits to docs/design/1933-design-for-gpu-support.md were needed after the refresh.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO’s origin/main refresh is applied cleanly in this checkout. I verified the refreshed branch state and reproduced that the isolated coreJVM/testOnly dev.bosatsu.codegen.clang.ClangGenTest -- --log=failure suite passes locally (Passed: Total 31, Failed 0, Errors 0). I also reran the required scripts/test_basic.sh, but in this checkout it remained running for an extended period under heavy GC pressure and never reached a final pass summary within the turn, so I cannot truthfully return a ready-to-push commit message yet.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO's fetch and merge of origin/main completed successfully; local head is d501602 with only docs/design/1933-design-for-gpu-support.md added against origin/main. I ran the required scripts/test_basic.sh. The sbt test phase reported Passed: Total 67, Failed 0, Errors 0, but the wrapper then emitted timeout diagnostics for dev.bosatsu.DefRecursionCheckTest.non recursive cases are allowed and dev.bosatsu.SelfCallKindTest.test selfCallKind, and the command remained running without a clean script exit. Because the required pre-push gate did not complete cleanly, I am not returning a ready-to-push commit_message.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO's fetch and merge of origin/main completed successfully; local head is 13005fc with only docs/design/1933-design-for-gpu-support.md added against origin/main. I ran the required scripts/test_basic.sh. The sbt test command reported Passed: Total 67, Failed 0, Errors 0, Passed 67 and [success] Total time: 570 s, but the process then remained running and emitted timeout diagnostics for dev.bosatsu.MatchlessRegressionTest.polymorphic recursion lowers to while in Matchless without self-calls, dev.bosatsu.Issue1633Test.issue 1633: loop+string-match lowering avoids captured non-function apply, and dev.bosatsu.tool.CompileCacheTest.dependency interface changes invalidate the compile key. Because the required pre-push gate did not exit cleanly, I am not returning a ready-to-push commit_message.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO's fetch and merge of origin/main completed successfully; local head is f7a40ad with only docs/design/1933-design-for-gpu-support.md added against origin/main. I ran the required scripts/test_basic.sh. The sbt test command reported Passed: Total 67, Failed 0, Errors 0, Passed 67 and [success] Total time: 534 s, but the wrapper then remained running and emitted timeout diagnostics for dev.bosatsu.Regressions.test complex recursion case from #196 and dev.bosatsu.OperatorTest.test operator precedence in real programs. Because the required pre-push gate did not exit cleanly, I am not returning a ready-to-push commit_message.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO's fetch and merge of origin/main completed successfully; local head is f87e34c with only docs/design/1933-design-for-gpu-support.md added against origin/main. I ran the required scripts/test_basic.sh. The command started under sbt 1.12.10 and began compiling, but it remained running and silent for an extended period after the last output (Compiling 1 protobuf files... / done compiling) and never produced a clean pass/fail exit during this turn. Because the required pre-push gate did not complete cleanly, I am not returning a ready-to-push commit_message.

@johnynek

Copy link
Copy Markdown
Owner Author

The GitHub Actions run 25018956348 was cancelled during setup/checkout/cache handling, before any project test failure was shown in the provided log, so there is no code or design-doc repair to apply. Locally, the checkout is clean at f87e34c and the only diff against origin/main is docs/design/1933-design-for-gpu-support.md. I ran the required scripts/test_basic.sh under sbt 1.12.10; it completed sbt setup and early compilation, then remained running and silent for an extended period after the protobuf/base compilation output without producing a final pass/fail exit. Because the required pre-push gate did not complete cleanly during this turn, I am not returning a ready-to-push commit_message.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO's fetch and merge of origin/main completed successfully; local head is f54ae30 with only docs/design/1933-design-for-gpu-support.md added against origin/main. I ran the required scripts/test_basic.sh. The sbt test command reported Passed: Total 74, Failed 0, Errors 0, Passed 74 and [success] Total time: 221 s, but the wrapper process remained running for several minutes after the sbt success summary and did not produce a clean shell exit during this turn. Because the required pre-push gate did not complete cleanly, I am not returning a ready-to-push commit_message.

@johnynek

Copy link
Copy Markdown
Owner Author

MergeXO feedback automation is blocked because the agent returned commit_message but no new staged changes or local commits were detected.

Action: request concrete file edits (or explicit git_ops), then unblock this PR.

After resolving this, use /mergexo unblock on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant