Skip to content

Exposing cuda profiling API through warp#1591

Closed
felixmey wants to merge 2 commits into
NVIDIA:mainfrom
felixmey:fmeyer/added-cudaProfilingAPI
Closed

Exposing cuda profiling API through warp#1591
felixmey wants to merge 2 commits into
NVIDIA:mainfrom
felixmey:fmeyer/added-cudaProfilingAPI

Conversation

@felixmey

@felixmey felixmey commented Jun 26, 2026

Copy link
Copy Markdown

Description

Exposes CUDA's profiler control API (cudaProfilerStart/cudaProfilerStop) through Warp's public Python API. This lets users restrict an external profiler's data collection to a region of interest—for example, profiling only a few simulation steps after warm-up instead of capturing an entire run, including JIT compilation, allocations, and GPU clock ramp-up.

The new entry points are:

  • wp.cuda_profiler_start() — begin profiler data collection (equivalent to cudaProfilerStart).
  • wp.cuda_profiler_stop() — end profiler data collection (equivalent to cudaProfilerStop).
  • wp.cuda_profiler_range() — a context manager that brackets a region with the two calls, stopping cleanly even if the body raises.

The calls are process-global rather than tied to a device, and are no-ops on CPU-only builds. They only mark the region; the external profiler must be told to honor it (nsys profile --capture-range=cudaProfilerApi or ncu --profile-from-start off).

Changes

  • Native (warp.cu/warp.cpp/warp.h): Add wp_cuda_profiler_start/wp_cuda_profiler_stop exports calling cudaProfilerStart/cudaProfilerStop (via <cuda_profiler_api.h>), with no-op stubs in the CPU-only path.
  • Python API (_src/context.py): Register the new ctypes signatures in Runtime.init and add the cuda_profiler_start, cuda_profiler_stop, and cuda_profiler_range wrappers; re-export them through warp/__init__.py and warp/__init__.pyi.
  • Docs: New "Limiting the Profiler Capture Range" section in docs/deep_dive/profiling.rst (with Nsight Systems / Nsight Compute usage), an Nsight Compute cross-reference, and a "CUDA Profiler" entry in the API reference.
  • Example: warp/examples/core/example_cuda_profiler.py — an N-body simulation that warms up, then brackets the profiled steps with wp.cuda_profiler_range().
  • Tests: warp/tests/test_cuda_profiler.py registered in unittest_suites.py.
  • CHANGELOG.md: Added an entry under Unreleased.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • CHANGELOG.md is updated for any user-facing changes under the Unreleased section.

Validation summary

warp/tests/test_cuda_profiler.py (TestCudaProfiler) covers the Python-level contract, since profiler toggling itself is only observable from an external tool:

  • test_start_stop_invoke_core — mocks the native entry points and confirms wp.cuda_profiler_start()/wp.cuda_profiler_stop() each invoke their core binding exactly once.
  • test_range_invokes_start_then_stop — confirms the context manager calls start on entry and defers stop until exit.
  • test_range_stops_on_exception — confirms the finally block still stops profiling when the body raises.
  • test_smoke_on_cuda (CUDA-only) — exercises the real native binding end-to-end, validating the ctypes signature and that the calls run without raising around a real kernel launch.

The example was run manually under nsys profile --capture-range=cudaProfilerApi to confirm collection is limited to the bracketed region.

New feature / enhancement

import warp as wp

wp.init()

example = ...  # your simulation

# warm up outside the capture range
for _ in range(200):
    example.step()

# only this region is collected by an external profiler launched with
#   nsys profile --capture-range=cudaProfilerApi  (or  ncu --profile-from-start off)
with wp.cuda_profiler_range():
    for _ in range(100):
        example.step()
    wp.synchronize_device()  # include trailing async work before stop

Summary by CodeRabbit

  • New Features
    • Added Python CUDA profiler controls (start, stop) and a cuda_profiler_range() context manager to capture a scoped profiling region.
    • Added a CUDA profiler-focused example that profiles a GPU workload within the capture range.
  • Documentation
    • Documented the CUDA profiler APIs and how to limit capture ranges, including recommended external profiler workflows.
  • Bug Fixes
    • Ensured profiling stops automatically on exit, including when an error occurs, and behaves as a safe no-op when CUDA isn’t available.
  • Tests
    • Added unit and smoke tests for profiler start/stop and range behavior.

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 76735206-5a96-410e-ab4e-212479e79ddd

📥 Commits

Reviewing files that changed from the base of the PR and between fdc19a2 and 8a68a45.

📒 Files selected for processing (12)
  • CHANGELOG.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/profiling.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/examples/core/example_cuda_profiler.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/test_cuda_profiler.py
  • warp/tests/unittest_suites.py
💤 Files with no reviewable changes (2)
  • warp/tests/unittest_suites.py
  • warp/tests/test_cuda_profiler.py
✅ Files skipped from review due to trivial changes (3)
  • docs/api_reference/warp.rst
  • CHANGELOG.md
  • docs/deep_dive/profiling.rst
🚧 Files skipped from review as they are similar to previous changes (7)
  • warp/native/warp.cpp
  • warp/init.py
  • warp/init.pyi
  • warp/native/warp.h
  • warp/_src/context.py
  • warp/native/warp.cu
  • warp/examples/core/example_cuda_profiler.py

📝 Walkthrough

Walkthrough

Adds CUDA profiler control APIs to Warp’s native layer and Python surface, documents the capture-range workflow, adds a profiling example, and extends tests and suite registration.

Changes

CUDA profiler support

Layer / File(s) Summary
Native CUDA profiler entry points
warp/native/warp.h, warp/native/warp.cu, warp/native/warp.cpp
Declares wp_cuda_profiler_start() and wp_cuda_profiler_stop(), calls cudaProfilerStart()/cudaProfilerStop() with CUDA error checking, and adds non-CUDA no-op stubs.
Python profiler API and exports
warp/_src/context.py, warp/__init__.py, warp/__init__.pyi
Adds ctypes bindings, Python wrappers, a cuda_profiler_range() context manager, and top-level warp exports for the profiler controls.
Profiler documentation
CHANGELOG.md, docs/api_reference/warp.rst, docs/deep_dive/profiling.rst
Documents the new profiler APIs, adds API reference entries, and describes capture-range usage with external profilers.
CUDA profiler example
warp/examples/core/example_cuda_profiler.py
Adds the N-body example kernels, simulation setup, stepping logic, and a script entrypoint that profiles a timed run inside wp.cuda_profiler_range().
Profiler tests and suite registration
warp/tests/test_cuda_profiler.py, warp/tests/unittest_suites.py
Adds unit tests for start/stop and range behavior, a CUDA smoke test, and suite registration for the new test class.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: exposing CUDA profiling APIs through Warp.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 markdownlint-cli2 (0.22.1)
CHANGELOG.md

markdownlint-cli2 wrapper config was not available before execution


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Around line 19-21: The new changelog bullet in CHANGELOG.md is missing the
required issue reference for a user-facing API change. Update the
wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and wp.cuda_profiler_range()
entry in the Unreleased section to include the appropriate GH-... reference in
the same style as nearby items, while keeping the wording in imperative present
tense and avoiding implementation details.

In `@docs/deep_dive/profiling.rst`:
- Around line 363-394: Clarify the profiling guidance so it states that all
participating CUDA work must be synchronized before stopping the profiler, not
just a single wp.synchronize_device() call. Update the wording around
wp.cuda_profiler_range() and wp.cuda_profiler_stop() to mention that
multi-device or cross-stream workloads must flush every relevant device/stream
before capture ends, while keeping the existing examples for Nsight Systems and
Nsight Compute aligned with this behavior.

In `@warp/examples/core/example_cuda_profiler.py`:
- Around line 136-162: The CUDA profiler example currently runs even when
args.device is CPU or CUDA is unavailable, but wp.cuda_profiler_range() only
makes sense for a CUDA capture. Update example_cuda_profiler.py around the
wp.ScopedDevice(args.device) / wp.cuda_profiler_range() flow to require a CUDA
device up front and fail fast with a clear error if the selected device is not
CUDA. Keep the warmup/profile loops inside the CUDA-only path so Example.step()
is not executed in a misleading CPU fallback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 91fcbf12-321d-4168-a371-683693b8c7a3

📥 Commits

Reviewing files that changed from the base of the PR and between afbd4ed and 8d44869.

📒 Files selected for processing (12)
  • CHANGELOG.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/profiling.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/examples/core/example_cuda_profiler.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/test_cuda_profiler.py
  • warp/tests/unittest_suites.py

Comment thread CHANGELOG.md
Comment thread docs/deep_dive/profiling.rst Outdated
Comment thread warp/examples/core/example_cuda_profiler.py Outdated
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR exposes CUDA's profiler control API (cudaProfilerStart/cudaProfilerStop) through Warp's Python surface, enabling users to bracket a region of interest for external profilers like Nsight Systems and Nsight Compute. CPU-only stubs are included so the calls are safe no-ops on non-CUDA builds.

  • Adds wp_cuda_profiler_start/wp_cuda_profiler_stop to the native layer (warp.cu CUDA path, warp.cpp CPU-only stubs, warp.h declarations), mirroring the existing pattern for other CUDA-specific exports in this codebase.
  • Registers ctypes signatures and exposes three Python wrappers — cuda_profiler_start, cuda_profiler_stop, and the cuda_profiler_range context manager — re-exported through __init__.py and typed in __init__.pyi.
  • Adds tests (mock-based for the Python contract, a smoke test for CUDA devices, and a no-op test for CPU builds), an N-body example, and updated documentation.

Confidence Score: 5/5

Safe to merge — the change is additive, CPU-only stubs are present, and the new symbols cannot break existing callers.

The implementation is thin and purely additive: two empty no-op stubs in the CPU path, two one-liner check_cuda wrappers in the CUDA path, and three Python functions that call them unconditionally. The pattern is identical to other CUDA-specific exports already in the codebase. No existing call sites are modified, and the new functions cannot be reached unless the caller explicitly imports and invokes them.

No files require special attention.

Important Files Changed

Filename Overview
warp/_src/context.py Registers ctypes argtypes/restype for the two new native entry points and adds three clean Python wrappers; follows existing patterns for no-argument CUDA functions.
warp/native/warp.cu Adds #include <cuda_profiler_api.h> and two one-liner implementations using check_cuda, consistent with the existing CUDA function style in this file (no WP_API prefix, matching the pattern of wp_cuda_context_synchronize etc.).
warp/native/warp.cpp Adds empty CPU-only stubs for wp_cuda_profiler_start and wp_cuda_profiler_stop with WP_API, making calls safe no-ops on non-CUDA builds.
warp/native/warp.h Adds WP_API declarations for the two profiler control functions, consistent with the surrounding API surface.
warp/tests/test_cuda_profiler.py Four tests cover the Python-level contract via mocks, a CUDA smoke test, and a CPU-only no-op path; skip conditions and assertions are correct.
warp/examples/core/example_cuda_profiler.py Self-contained N-body example demonstrating the new API with warm-up, synchronize-before-stop, and clear CLI options.
warp/init.py Re-exports the three new symbols with the standard as-X pattern under a clear category comment.
warp/init.pyi Adds stub imports for the three new symbols; missing blank-line separator between timing constants and the new profiler block (cosmetic).
docs/deep_dive/profiling.rst New section explains the capture range API with correct nsys/ncu command-line flags, synchronize advice, and cross-references.
docs/api_reference/warp.rst Adds CUDA Profiler autosummary entries for the three new symbols under the correct section.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User as Python User Code
    participant WP as warp Python API
    participant Ctypes as ctypes binding
    participant Native as warp.cu / warp.cpp stub

    User->>WP: wp.cuda_profiler_start()
    WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
    Ctypes->>Native: wp_cuda_profiler_start()
    alt CUDA build
        Native->>Native: cudaProfilerStart()
    else CPU-only build
        Native->>Native: (no-op stub)
    end

    User->>WP: with wp.cuda_profiler_range():
    WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
    Ctypes->>Native: wp_cuda_profiler_start()
    User->>User: kernel launches / simulation steps
    User->>WP: wp.synchronize_device()
    Note over User,WP: ensure async work finishes before stop
    WP->>Ctypes: runtime.core.wp_cuda_profiler_stop()
    Ctypes->>Native: wp_cuda_profiler_stop()
    alt CUDA build
        Native->>Native: cudaProfilerStop()
    else CPU-only build
        Native->>Native: (no-op stub)
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User as Python User Code
    participant WP as warp Python API
    participant Ctypes as ctypes binding
    participant Native as warp.cu / warp.cpp stub

    User->>WP: wp.cuda_profiler_start()
    WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
    Ctypes->>Native: wp_cuda_profiler_start()
    alt CUDA build
        Native->>Native: cudaProfilerStart()
    else CPU-only build
        Native->>Native: (no-op stub)
    end

    User->>WP: with wp.cuda_profiler_range():
    WP->>Ctypes: runtime.core.wp_cuda_profiler_start()
    Ctypes->>Native: wp_cuda_profiler_start()
    User->>User: kernel launches / simulation steps
    User->>WP: wp.synchronize_device()
    Note over User,WP: ensure async work finishes before stop
    WP->>Ctypes: runtime.core.wp_cuda_profiler_stop()
    Ctypes->>Native: wp_cuda_profiler_stop()
    alt CUDA build
        Native->>Native: cudaProfilerStop()
    else CPU-only build
        Native->>Native: (no-op stub)
    end
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into fmeyer/added-cu..." | Re-trigger Greptile

@felixmey felixmey force-pushed the fmeyer/added-cudaProfilingAPI branch from 8d44869 to fdc19a2 Compare June 26, 2026 15:18

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
warp/tests/test_cuda_profiler.py (1)

69-87: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a real CPU-only no-op smoke test.

This file only hits the real binding on CUDA builds. The PR also promises that these APIs are safe no-ops on CPU-only builds, so a small skipIf(wp.is_cuda_available(), ...) test that calls wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and wp.cuda_profiler_range() would catch stub/binding regressions on CPU-only wheels.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@warp/tests/test_cuda_profiler.py` around lines 69 - 87, Add a CPU-only smoke
test in test_cuda_profiler that runs only when wp.is_cuda_available() is false
and explicitly exercises wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and
wp.cuda_profiler_range() to verify they are safe no-ops on non-CUDA builds. Keep
the existing CUDA test as-is, and add the new no-op coverage in the same test
class so regressions in the CPU stub or binding can be caught without requiring
a GPU.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@warp/tests/test_cuda_profiler.py`:
- Around line 69-87: Add a CPU-only smoke test in test_cuda_profiler that runs
only when wp.is_cuda_available() is false and explicitly exercises
wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and wp.cuda_profiler_range()
to verify they are safe no-ops on non-CUDA builds. Keep the existing CUDA test
as-is, and add the new no-op coverage in the same test class so regressions in
the CPU stub or binding can be caught without requiring a GPU.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 3ab8c2ad-0a29-490f-b0da-1930e2a476a2

📥 Commits

Reviewing files that changed from the base of the PR and between 8d44869 and fdc19a2.

📒 Files selected for processing (12)
  • CHANGELOG.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/profiling.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/examples/core/example_cuda_profiler.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/test_cuda_profiler.py
  • warp/tests/unittest_suites.py
✅ Files skipped from review due to trivial changes (4)
  • warp/native/warp.h
  • docs/deep_dive/profiling.rst
  • docs/api_reference/warp.rst
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (7)
  • warp/init.pyi
  • warp/native/warp.cpp
  • warp/tests/unittest_suites.py
  • warp/native/warp.cu
  • warp/examples/core/example_cuda_profiler.py
  • warp/init.py
  • warp/_src/context.py

Add wp.cuda_profiler_start(), wp.cuda_profiler_stop(), and the
wp.cuda_profiler_range() context manager, wrapping cudaProfilerStart/
cudaProfilerStop so users can restrict an external profiler's capture
range (e.g. Nsight Systems --capture-range=cudaProfilerApi, Nsight
Compute --profile-from-start off) to a region of interest.

Includes native bindings (CUDA implementation plus CPU stubs), public
API exports, profiling guide documentation, an N-body example, and unit
tests covering the wrapper logic and the native binding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Felix Meyer <fmeyer@nvidia.com>
@felixmey felixmey force-pushed the fmeyer/added-cudaProfilingAPI branch from fdc19a2 to 022867e Compare June 26, 2026 16:04
@felixmey felixmey closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant