Exposing cuda profiling API through warp#1591
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (12)
💤 Files with no reviewable changes (2)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (7)
📝 WalkthroughWalkthroughAdds CUDA profiler control APIs to Warp’s native layer and Python surface, documents the capture-range workflow, adds a profiling example, and extends tests and suite registration. ChangesCUDA profiler support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 markdownlint-cli2 (0.22.1)CHANGELOG.mdmarkdownlint-cli2 wrapper config was not available before execution Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Around line 19-21: The new changelog bullet in CHANGELOG.md is missing the
required issue reference for a user-facing API change. Update the
wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and wp.cuda_profiler_range()
entry in the Unreleased section to include the appropriate GH-... reference in
the same style as nearby items, while keeping the wording in imperative present
tense and avoiding implementation details.
In `@docs/deep_dive/profiling.rst`:
- Around line 363-394: Clarify the profiling guidance so it states that all
participating CUDA work must be synchronized before stopping the profiler, not
just a single wp.synchronize_device() call. Update the wording around
wp.cuda_profiler_range() and wp.cuda_profiler_stop() to mention that
multi-device or cross-stream workloads must flush every relevant device/stream
before capture ends, while keeping the existing examples for Nsight Systems and
Nsight Compute aligned with this behavior.
In `@warp/examples/core/example_cuda_profiler.py`:
- Around line 136-162: The CUDA profiler example currently runs even when
args.device is CPU or CUDA is unavailable, but wp.cuda_profiler_range() only
makes sense for a CUDA capture. Update example_cuda_profiler.py around the
wp.ScopedDevice(args.device) / wp.cuda_profiler_range() flow to require a CUDA
device up front and fail fast with a clear error if the selected device is not
CUDA. Keep the warmup/profile loops inside the CUDA-only path so Example.step()
is not executed in a misleading CPU fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 91fcbf12-321d-4168-a371-683693b8c7a3
📒 Files selected for processing (12)
CHANGELOG.mddocs/api_reference/warp.rstdocs/deep_dive/profiling.rstwarp/__init__.pywarp/__init__.pyiwarp/_src/context.pywarp/examples/core/example_cuda_profiler.pywarp/native/warp.cppwarp/native/warp.cuwarp/native/warp.hwarp/tests/test_cuda_profiler.pywarp/tests/unittest_suites.py
Greptile SummaryThis PR exposes CUDA's profiler control API (
Confidence Score: 5/5Safe to merge — the change is additive, CPU-only stubs are present, and the new symbols cannot break existing callers. The implementation is thin and purely additive: two empty no-op stubs in the CPU path, two one-liner check_cuda wrappers in the CUDA path, and three Python functions that call them unconditionally. The pattern is identical to other CUDA-specific exports already in the codebase. No existing call sites are modified, and the new functions cannot be reached unless the caller explicitly imports and invokes them. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant User as Python User Code
participant WP as warp Python API
participant Ctypes as ctypes binding
participant Native as warp.cu / warp.cpp stub
User->>WP: wp.cuda_profiler_start()
WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
Ctypes->>Native: wp_cuda_profiler_start()
alt CUDA build
Native->>Native: cudaProfilerStart()
else CPU-only build
Native->>Native: (no-op stub)
end
User->>WP: with wp.cuda_profiler_range():
WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
Ctypes->>Native: wp_cuda_profiler_start()
User->>User: kernel launches / simulation steps
User->>WP: wp.synchronize_device()
Note over User,WP: ensure async work finishes before stop
WP->>Ctypes: runtime.core.wp_cuda_profiler_stop()
Ctypes->>Native: wp_cuda_profiler_stop()
alt CUDA build
Native->>Native: cudaProfilerStop()
else CPU-only build
Native->>Native: (no-op stub)
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant User as Python User Code
participant WP as warp Python API
participant Ctypes as ctypes binding
participant Native as warp.cu / warp.cpp stub
User->>WP: wp.cuda_profiler_start()
WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
Ctypes->>Native: wp_cuda_profiler_start()
alt CUDA build
Native->>Native: cudaProfilerStart()
else CPU-only build
Native->>Native: (no-op stub)
end
User->>WP: with wp.cuda_profiler_range():
WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
Ctypes->>Native: wp_cuda_profiler_start()
User->>User: kernel launches / simulation steps
User->>WP: wp.synchronize_device()
Note over User,WP: ensure async work finishes before stop
WP->>Ctypes: runtime.core.wp_cuda_profiler_stop()
Ctypes->>Native: wp_cuda_profiler_stop()
alt CUDA build
Native->>Native: cudaProfilerStop()
else CPU-only build
Native->>Native: (no-op stub)
end
Reviews (3): Last reviewed commit: "Merge branch 'main' into fmeyer/added-cu..." | Re-trigger Greptile |
8d44869 to
fdc19a2
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
warp/tests/test_cuda_profiler.py (1)
69-87: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a real CPU-only no-op smoke test.
This file only hits the real binding on CUDA builds. The PR also promises that these APIs are safe no-ops on CPU-only builds, so a small
skipIf(wp.is_cuda_available(), ...)test that callswp.cuda_profiler_start(),wp.cuda_profiler_stop(), andwp.cuda_profiler_range()would catch stub/binding regressions on CPU-only wheels.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@warp/tests/test_cuda_profiler.py` around lines 69 - 87, Add a CPU-only smoke test in test_cuda_profiler that runs only when wp.is_cuda_available() is false and explicitly exercises wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and wp.cuda_profiler_range() to verify they are safe no-ops on non-CUDA builds. Keep the existing CUDA test as-is, and add the new no-op coverage in the same test class so regressions in the CPU stub or binding can be caught without requiring a GPU.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@warp/tests/test_cuda_profiler.py`:
- Around line 69-87: Add a CPU-only smoke test in test_cuda_profiler that runs
only when wp.is_cuda_available() is false and explicitly exercises
wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and wp.cuda_profiler_range()
to verify they are safe no-ops on non-CUDA builds. Keep the existing CUDA test
as-is, and add the new no-op coverage in the same test class so regressions in
the CPU stub or binding can be caught without requiring a GPU.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 3ab8c2ad-0a29-490f-b0da-1930e2a476a2
📒 Files selected for processing (12)
CHANGELOG.mddocs/api_reference/warp.rstdocs/deep_dive/profiling.rstwarp/__init__.pywarp/__init__.pyiwarp/_src/context.pywarp/examples/core/example_cuda_profiler.pywarp/native/warp.cppwarp/native/warp.cuwarp/native/warp.hwarp/tests/test_cuda_profiler.pywarp/tests/unittest_suites.py
✅ Files skipped from review due to trivial changes (4)
- warp/native/warp.h
- docs/deep_dive/profiling.rst
- docs/api_reference/warp.rst
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (7)
- warp/init.pyi
- warp/native/warp.cpp
- warp/tests/unittest_suites.py
- warp/native/warp.cu
- warp/examples/core/example_cuda_profiler.py
- warp/init.py
- warp/_src/context.py
Add wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and the wp.cuda_profiler_range() context manager, wrapping cudaProfilerStart/ cudaProfilerStop so users can restrict an external profiler's capture range (e.g. Nsight Systems --capture-range=cudaProfilerApi, Nsight Compute --profile-from-start off) to a region of interest. Includes native bindings (CUDA implementation plus CPU stubs), public API exports, profiling guide documentation, an N-body example, and unit tests covering the wrapper logic and the native binding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Felix Meyer <fmeyer@nvidia.com>
fdc19a2 to
022867e
Compare
Description
Exposes CUDA's profiler control API (
cudaProfilerStart/cudaProfilerStop) through Warp's public Python API. This lets users restrict an external profiler's data collection to a region of interest—for example, profiling only a few simulation steps after warm-up instead of capturing an entire run, including JIT compilation, allocations, and GPU clock ramp-up.The new entry points are:
wp.cuda_profiler_start()— begin profiler data collection (equivalent tocudaProfilerStart).wp.cuda_profiler_stop()— end profiler data collection (equivalent tocudaProfilerStop).wp.cuda_profiler_range()— a context manager that brackets a region with the two calls, stopping cleanly even if the body raises.The calls are process-global rather than tied to a device, and are no-ops on CPU-only builds. They only mark the region; the external profiler must be told to honor it (nsys profile --capture-range=cudaProfilerApi or ncu --profile-from-start off).
Changes
warp.cu/warp.cpp/warp.h): Addwp_cuda_profiler_start/wp_cuda_profiler_stopexports callingcudaProfilerStart/cudaProfilerStop(via<cuda_profiler_api.h>), with no-op stubs in the CPU-only path._src/context.py): Register the new ctypes signatures in Runtime.init and add thecuda_profiler_start,cuda_profiler_stop, andcuda_profiler_rangewrappers; re-export them throughwarp/__init__.pyandwarp/__init__.pyi.warp/examples/core/example_cuda_profiler.py— an N-body simulation that warms up, then brackets the profiled steps withwp.cuda_profiler_range().warp/tests/test_cuda_profiler.pyregistered inunittest_suites.py.CHANGELOG.md: Added an entry under Unreleased.Checklist
Unreleasedsection.Validation summary
warp/tests/test_cuda_profiler.py(TestCudaProfiler) covers the Python-level contract, since profiler toggling itself is only observable from an external tool:test_start_stop_invoke_core— mocks the native entry points and confirms wp.cuda_profiler_start()/wp.cuda_profiler_stop() each invoke their core binding exactly once.test_range_invokes_start_then_stop— confirms the context manager calls start on entry and defers stop until exit.test_range_stops_on_exception— confirms the finally block still stops profiling when the body raises.test_smoke_on_cuda(CUDA-only) — exercises the real native binding end-to-end, validating the ctypes signature and that the calls run without raising around a real kernel launch.The example was run manually under
nsys profile --capture-range=cudaProfilerApito confirm collection is limited to the bracketed region.New feature / enhancement
Summary by CodeRabbit
start,stop) and acuda_profiler_range()context manager to capture a scoped profiling region.