Skip to content

Implement parallel cuda::std::stable_partition#8238

Open
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_stable_partition
Open

Implement parallel cuda::std::stable_partition#8238
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_stable_partition

Conversation

@miscco
Copy link
Copy Markdown
Contributor

@miscco miscco commented Mar 31, 2026

This implements the stable partition algorithm for the cuda backend.

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #8228

@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 31, 2026
@miscco miscco requested review from a team as code owners March 31, 2026 14:59
@miscco miscco requested review from elstehle and pciolkosz March 31, 2026 14:59
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 31, 2026
@miscco
Copy link
Copy Markdown
Contributor Author

miscco commented Mar 31, 2026

Performance looks good, except for small arrays:

['thrust_stable_partition.json', 'pstl_stable_partition.json']
# base

## [1] NVIDIA RTX A6000

|  T{ct}  |  Elements  |  Entropy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|-----------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^16    |     1     |  30.179 us |       4.76% |  29.136 us |       3.93% |  -1.043 us |  -3.46% |   SAME   |
|   I8    |    2^20    |     1     |  36.844 us |       4.72% |  32.081 us |      13.19% |  -4.763 us | -12.93% |   FAST   |
|   I8    |    2^24    |     1     | 152.954 us |       0.78% | 153.300 us |       1.66% |   0.346 us |   0.23% |   SAME   |
|   I8    |    2^28    |     1     |   1.963 ms |       1.11% |   2.104 ms |       0.34% | 140.343 us |   7.15% |   SLOW   |
|   I8    |    2^16    |   0.544   |  30.298 us |       5.31% |  29.278 us |       5.05% |  -1.020 us |  -3.37% |   SAME   |
|   I8    |    2^20    |   0.544   |  36.571 us |       3.90% |  31.378 us |      10.85% |  -5.193 us | -14.20% |   FAST   |
|   I8    |    2^24    |   0.544   | 172.664 us |       0.66% | 175.749 us |       1.66% |   3.085 us |   1.79% |   SLOW   |
|   I8    |    2^28    |   0.544   |   2.511 ms |       0.73% |   2.659 ms |       0.48% | 148.345 us |   5.91% |   SLOW   |
|   I8    |    2^16    |     0     |  30.130 us |       3.65% |  29.193 us |       6.98% |  -0.937 us |  -3.11% |   SAME   |
|   I8    |    2^20    |     0     |  37.782 us |       3.99% |  32.574 us |      11.30% |  -5.207 us | -13.78% |   FAST   |
|   I8    |    2^24    |     0     | 204.069 us |       0.86% | 208.020 us |       1.38% |   3.951 us |   1.94% |   SLOW   |
|   I8    |    2^28    |     0     |   2.996 ms |       0.70% |   3.169 ms |       0.52% | 173.603 us |   5.80% |   SLOW   |
|   I16   |    2^16    |     1     |  23.812 us |       2.61% |  22.819 us |       4.78% |  -0.993 us |  -4.17% |   FAST   |
|   I16   |    2^20    |     1     |  39.898 us |       4.19% |  37.668 us |       5.10% |  -2.230 us |  -5.59% |   FAST   |
|   I16   |    2^24    |     1     | 233.253 us |       3.25% | 230.850 us |       1.54% |  -2.403 us |  -1.03% |   SAME   |
|   I16   |    2^28    |     1     |   3.311 ms |       0.90% |   3.356 ms |       0.60% |  44.972 us |   1.36% |   SLOW   |
|   I16   |    2^16    |   0.544   |  30.447 us |       5.87% |  29.085 us |       6.86% |  -1.362 us |  -4.47% |   SAME   |
|   I16   |    2^20    |   0.544   |  40.506 us |       1.98% |  38.793 us |       5.41% |  -1.713 us |  -4.23% |   FAST   |
|   I16   |    2^24    |   0.544   | 279.278 us |       0.74% | 278.312 us |       1.27% |  -0.966 us |  -0.35% |   SAME   |
|   I16   |    2^28    |   0.544   |   4.186 ms |       0.75% |   4.235 ms |       0.54% |  48.690 us |   1.16% |   SLOW   |
|   I16   |    2^16    |     0     |  30.059 us |       2.64% |  29.251 us |       5.49% |  -0.808 us |  -2.69% |   FAST   |
|   I16   |    2^20    |     0     |  41.284 us |       3.57% |  39.863 us |       5.06% |  -1.421 us |  -3.44% |   SAME   |
|   I16   |    2^24    |     0     | 342.282 us |       0.74% | 341.046 us |       1.13% |  -1.237 us |  -0.36% |   SAME   |
|   I16   |    2^28    |     0     |   5.165 ms |       0.63% |   5.211 ms |       0.54% |  45.747 us |   0.89% |   SLOW   |
|   I32   |    2^16    |     1     |  24.073 us |       5.62% |  22.907 us |       4.82% |  -1.167 us |  -4.85% |   FAST   |
|   I32   |    2^20    |     1     |  41.817 us |       1.84% |  35.707 us |       3.13% |  -6.110 us | -14.61% |   FAST   |
|   I32   |    2^24    |     1     | 416.785 us |       0.28% | 413.820 us |       0.64% |  -2.964 us |  -0.71% |   FAST   |
|   I32   |    2^28    |     1     |   6.334 ms |       0.52% |   6.354 ms |       0.10% |  20.521 us |   0.32% |   SLOW   |
|   I32   |    2^16    |   0.544   |  30.664 us |       5.28% |  29.494 us |       3.76% |  -1.170 us |  -3.81% |   FAST   |
|   I32   |    2^20    |   0.544   |  49.686 us |       1.95% |  43.188 us |       2.76% |  -6.498 us | -13.08% |   FAST   |
|   I32   |    2^24    |   0.544   | 507.327 us |       0.27% | 504.586 us |       0.55% |  -2.741 us |  -0.54% |   FAST   |
|   I32   |    2^28    |   0.544   |   7.887 ms |       0.52% |   7.905 ms |       0.08% |  17.372 us |   0.22% |   SLOW   |
|   I32   |    2^16    |     0     |  30.682 us |       2.64% |  29.760 us |       5.21% |  -0.922 us |  -3.01% |   FAST   |
|   I32   |    2^20    |     0     |  51.045 us |       2.36% |  45.657 us |       4.13% |  -5.388 us | -10.55% |   FAST   |
|   I32   |    2^24    |     0     | 615.270 us |       0.28% | 612.040 us |       0.49% |  -3.230 us |  -0.53% |   FAST   |
|   I32   |    2^28    |     0     |   9.506 ms |       0.62% |   9.528 ms |       0.08% |  21.585 us |   0.23% |   SLOW   |
|   I64   |    2^16    |     1     |  25.723 us |       4.89% |  23.506 us |       5.13% |  -2.217 us |  -8.62% |   FAST   |
|   I64   |    2^20    |     1     |  71.080 us |       1.27% |  65.907 us |       3.15% |  -5.174 us |  -7.28% |   FAST   |
|   I64   |    2^24    |     1     | 812.423 us |       0.29% | 807.194 us |       0.34% |  -5.229 us |  -0.64% |   FAST   |
|   I64   |    2^28    |     1     |  12.649 ms |       0.33% |  12.642 ms |       0.05% |  -6.832 us |  -0.05% |   FAST   |
|   I64   |    2^16    |   0.544   |  32.716 us |       5.12% |  30.384 us |       2.90% |  -2.332 us |  -7.13% |   FAST   |
|   I64   |    2^20    |   0.544   |  82.335 us |       1.29% |  75.231 us |       2.64% |  -7.104 us |  -8.63% |   FAST   |
|   I64   |    2^24    |   0.544   | 992.800 us |       0.17% | 989.138 us |       2.36% |  -3.662 us |  -0.37% |   FAST   |
|   I64   |    2^28    |   0.544   |  15.517 ms |       0.37% |  15.513 ms |       0.05% |  -3.684 us |  -0.02% |   SAME   |
|   I64   |    2^16    |     0     |  32.778 us |       4.66% |  30.362 us |       2.90% |  -2.416 us |  -7.37% |   FAST   |
|   I64   |    2^20    |     0     |  98.944 us |       1.11% |  91.952 us |       2.50% |  -6.992 us |  -7.07% |   FAST   |
|   I64   |    2^24    |     0     |   1.208 ms |       0.15% |   1.203 ms |       0.24% |  -4.716 us |  -0.39% |   FAST   |
|   I64   |    2^28    |     0     |  18.956 ms |       0.28% |  18.955 ms |       0.11% |  -0.489 us |  -0.00% |   SAME   |
|  I128   |    2^16    |     1     |  28.586 us |       3.99% |  23.575 us |       5.07% |  -5.011 us | -17.53% |   FAST   |
|  I128   |    2^20    |     1     | 122.405 us |       2.29% | 115.386 us |       1.65% |  -7.018 us |  -5.73% |   FAST   |
|  I128   |    2^24    |     1     |   1.605 ms |       0.17% |   1.603 ms |       0.24% |  -2.180 us |  -0.14% |   SAME   |
|  I128   |    2^28    |     1     |  25.365 ms |       0.19% |  25.387 ms |       0.05% |  21.911 us |   0.09% |   SLOW   |
|  I128   |    2^16    |   0.544   |  35.593 us |       2.31% |  30.330 us |       3.44% |  -5.264 us | -14.79% |   FAST   |
|  I128   |    2^20    |   0.544   | 149.305 us |       0.95% | 141.124 us |      15.91% |  -8.181 us |  -5.48% |   FAST   |
|  I128   |    2^24    |   0.544   |   1.989 ms |       1.80% |   1.975 ms |       0.20% | -14.218 us |  -0.71% |   FAST   |
|  I128   |    2^28    |   0.544   |  31.338 ms |       0.11% |  31.356 ms |       0.04% |  18.070 us |   0.06% |   SLOW   |
|  I128   |    2^16    |     0     |  36.027 us |       2.74% |  30.034 us |       3.14% |  -5.993 us | -16.64% |   FAST   |
|  I128   |    2^20    |     0     | 179.388 us |       0.64% | 170.127 us |       2.39% |  -9.260 us |  -5.16% |   FAST   |
|  I128   |    2^24    |     0     |   2.428 ms |       0.87% |   2.423 ms |       0.17% |  -4.314 us |  -0.18% |   FAST   |
|  I128   |    2^28    |     0     |  38.433 ms |       0.05% |  38.486 ms |       0.03% |  52.482 us |   0.14% |   SLOW   |
|   F32   |    2^16    |     1     |  24.539 us |       3.85% |  22.561 us |       4.74% |  -1.978 us |  -8.06% |   FAST   |
|   F32   |    2^20    |     1     |  42.586 us |       1.84% |  36.096 us |       2.81% |  -6.490 us | -15.24% |   FAST   |
|   F32   |    2^24    |     1     | 426.293 us |       2.75% | 414.904 us |       0.98% | -11.389 us |  -2.67% |   FAST   |
|   F32   |    2^28    |     1     |   6.366 ms |       0.61% |   6.364 ms |       0.10% |  -1.972 us |  -0.03% |   SAME   |
|   F32   |    2^16    |   0.544   |  31.225 us |       4.68% |  30.282 us |       4.21% |  -0.944 us |  -3.02% |   SAME   |
|   F32   |    2^20    |   0.544   |  51.748 us |       2.23% |  45.713 us |       5.47% |  -6.035 us | -11.66% |   FAST   |
|   F32   |    2^24    |   0.544   | 611.373 us |       0.30% | 607.623 us |       0.46% |  -3.749 us |  -0.61% |   FAST   |
|   F32   |    2^28    |   0.544   |   9.369 ms |       0.53% |   9.365 ms |       0.07% |  -4.185 us |  -0.04% |   SAME   |
|   F32   |    2^16    |     0     |  31.584 us |       5.23% |  29.694 us |       4.41% |  -1.889 us |  -5.98% |   FAST   |
|   F32   |    2^20    |     0     |  51.791 us |       2.39% |  45.221 us |       2.24% |  -6.570 us | -12.69% |   FAST   |
|   F32   |    2^24    |     0     | 616.578 us |       0.27% | 613.071 us |       0.48% |  -3.507 us |  -0.57% |   FAST   |
|   F32   |    2^28    |     0     |   9.527 ms |       0.48% |   9.530 ms |       0.07% |   3.694 us |   0.04% |   SAME   |
|   F64   |    2^16    |     1     |  26.844 us |       7.06% |  23.035 us |       4.23% |  -3.809 us | -14.19% |   FAST   |
|   F64   |    2^20    |     1     |  71.937 us |       1.81% |  65.336 us |       3.80% |  -6.601 us |  -9.18% |   FAST   |
|   F64   |    2^24    |     1     | 812.408 us |       0.23% | 807.173 us |       0.34% |  -5.235 us |  -0.64% |   FAST   |
|   F64   |    2^28    |     1     |  12.658 ms |       0.33% |  12.657 ms |       0.05% |  -0.973 us |  -0.01% |   SAME   |
|   F64   |    2^16    |   0.544   |  33.476 us |       5.00% |  30.172 us |       3.47% |  -3.303 us |  -9.87% |   FAST   |
|   F64   |    2^20    |   0.544   |  97.826 us |       0.95% |  88.960 us |       2.66% |  -8.866 us |  -9.06% |   FAST   |
|   F64   |    2^24    |   0.544   |   1.187 ms |       0.23% |   1.183 ms |       0.29% |  -4.390 us |  -0.37% |   FAST   |
|   F64   |    2^28    |   0.544   |  18.462 ms |       0.34% |  18.454 ms |       0.05% |  -8.047 us |  -0.04% |   SAME   |
|   F64   |    2^16    |     0     |  33.291 us |       3.02% |  30.010 us |       4.02% |  -3.281 us |  -9.86% |   FAST   |
|   F64   |    2^20    |     0     |  99.584 us |       1.16% |  91.254 us |       2.79% |  -8.330 us |  -8.36% |   FAST   |
|   F64   |    2^24    |     0     |   1.208 ms |       0.46% |   1.204 ms |       0.27% |  -4.783 us |  -0.40% |   FAST   |
|   F64   |    2^28    |     0     |  18.965 ms |       0.30% |  18.968 ms |       0.04% |   3.226 us |   0.02% |   SAME   |

@github-actions

This comment has been minimized.

@miscco miscco force-pushed the parallel_stable_partition branch from 53663fa to ac0aaa4 Compare April 1, 2026 06:40
@github-actions

This comment has been minimized.

This implements the stable partition algorithm for the cuda backend.

* `std::stable_partition` see https://en.cppreference.com/w/cpp/algorithm/stable_partition.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#8228
@miscco miscco force-pushed the parallel_stable_partition branch from ac0aaa4 to 7c470f6 Compare April 8, 2026 08:23
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 26m: Pass: 100%/108 | Total: 1d 13h | Max: 1h 07m | Hits: 94%/281794

See results here.

@bernhardmgruber
Copy link
Copy Markdown
Contributor

Do we understand where the perf differences come from? I appreciate we are slower in some cases, but why are we not seeing the exact same perf as Thrust?

Comment on lines +92 to +101
// Partition cannot run inplace, so we need to first copy the input into temporary storage
_CCCL_TRY_CUDA_API(
CUB_NS_QUALIFIER::DeviceTransform::TransformIf,
"__pstl_cuda_stable_partition: kernel launch of cub::DeviceTransform::TransformIf failed",
tuple<_InputIterator>{__first},
__storage.template __get_ptr<1>(),
__count,
CUB_NS_QUALIFIER::detail::transform::always_true_predicate{},
identity{},
__stream.get());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Could we just call cuda::std::copy here instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid that because of overhead and also nesting NVTX ranges, the call is so lightweight

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we cannot call the full algorithm, because they are synchronous and we do not want to introduce a stream synchronization here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, fine!

const size_t __num_selected =
__dispatch(__policy, ::cuda::std::move(__first), ::cuda::std::move(__last), ::cuda::std::move(__pred));

return __result + static_cast<iter_difference_t<_InputIterator>>(__num_selected);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I would let __pstl_dispatch<__pstl_algorithm::__stable_partition, __execution_backend::__cuda>::operator() just return the iterator already instead of the number of selected items.

Comment on lines +41 to +42
template <class Policy>
void test_partition(const Policy& policy, thrust::device_vector<int>& input)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Please also add a test that covers the stable nature of the algorithm. For example, partitioning a vector of pairs based on their first value, and then check whether the order of the second value was retained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::stable_partition

2 participants