Enable sccache wrapping of nvcc for full-arch CUDA builds#254
Merged
Conversation
Wrap nvcc with sccache (CMAKE_CUDA_COMPILER_LAUNCHER) for CUDA builds so the per-arch .cu device passes — the dominant cost of the ~70 min CUDA job — cache over Depot alongside the gcc host TUs, not just the C/C++ TUs. With the kernels cached, drop the single-arch validation shortcut: CI no longer sets CUDA_FAST_BUILD/CUDA_ARCH, so every run builds the full CMAKE_CUDA_ARCHITECTURES set (release-safe on PR/push as well as publish) and relies on the warm cache for speed. - build.sh: add -DCMAKE_CUDA_COMPILER_LAUNCHER=sccache, scoped to CUDA builds (GGML_CUDA in the cmake args), behind the existing probe. Broaden the mid-build retry trigger with "Compiler not supported" so an nvcc-hostile sccache falls back to an uncached green build instead of redding it. - publish.yml: remove CUDA_FAST_BUILD/CUDA_ARCH from the CUDA job and their DOCKCROSS_ARGS passthroughs; full arch every run. - CLAUDE.md: document nvcc caching + full-arch CI policy; CUDA_FAST_BUILD stays a local-dev-only knob. Warm-run verification of nvcc cache hits still pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PJGUpbfRCjbRcovCTq5v4u
withCacheReuse(int) and withSlotId(int) threw IllegalArgumentException with a static message string, which SpotBugs flags as WEM_WEAK_EXCEPTION_MESSAGING and fails the Build-and-analyze job. Include the offending value in each message so the exception carries dynamic context. Pre-existing on the base branch; surfaced on this PR's CI. The SessionTest assertion uses containsString and still matches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PJGUpbfRCjbRcovCTq5v4u
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
CMAKE_CUDA_COMPILER_LAUNCHER=sccache) in CUDA builds, so per-architecture.cudevice passes are cached alongside gcc C/C++ TUs. This was previously not cached due to sccache's limited nvcc support.CUDA_FAST_BUILDandCUDA_ARCHfrom thecrosscompile-linux-x86_64-cudajob. CI now always builds the fullCMAKE_CUDA_ARCHITECTURESset on every run (PR/push/dispatch/publish), relying on the warm sccache cache for speed instead of a reduced arch set.CUDA_FAST_BUILDas a local-dev knob: The env var remains inbuild_cuda_linux.shfor developers who want single-arch builds locally, but CI no longer uses it.GGML_CUDAin cmake args and conditionally add-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache. Include a "Compiler not supported" error signature in the mid-build retry fallback to handle cases where sccache cannot wrap nvcc.Rationale
The ~70 min CUDA job's dominant cost is nvcc recompiling each
.cukernel once per architecture. Previously, only gcc C/C++ TUs were cached; nvcc passes were not. With sccache now wrapping nvcc, the per-arch device passes are cached over Depot, making warm runs much faster. This allows CI to always build the full arch set (release-safe everywhere) without the complexity of conditional single-arch builds for validation runs. The first (cold-cache) run still pays the full nvcc cost; subsequent warm runs benefit from the cache.Test plan
sccache --show-statsshould show CUDA hits on the second build to confirm the speedupRelated issues / PRs
Closes the CUDA caching gap identified in the sccache rollout.
Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdhttps://claude.ai/code/session_01PJGUpbfRCjbRcovCTq5v4u