Implement parallel cuda::std::stable_partition#8238
Implement parallel cuda::std::stable_partition#8238miscco wants to merge 1 commit intoNVIDIA:mainfrom
cuda::std::stable_partition#8238Conversation
|
Performance looks good, except for small arrays: |
This comment has been minimized.
This comment has been minimized.
53663fa to
ac0aaa4
Compare
This comment has been minimized.
This comment has been minimized.
This implements the stable partition algorithm for the cuda backend. * `std::stable_partition` see https://en.cppreference.com/w/cpp/algorithm/stable_partition.html It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++ The functionality is publicly available yet and implemented in a private internal header Fixes NVIDIA#8228
ac0aaa4 to
7c470f6
Compare
🥳 CI Workflow Results🟩 Finished in 1h 26m: Pass: 100%/108 | Total: 1d 13h | Max: 1h 07m | Hits: 94%/281794See results here. |
|
Do we understand where the perf differences come from? I appreciate we are slower in some cases, but why are we not seeing the exact same perf as Thrust? |
| // Partition cannot run inplace, so we need to first copy the input into temporary storage | ||
| _CCCL_TRY_CUDA_API( | ||
| CUB_NS_QUALIFIER::DeviceTransform::TransformIf, | ||
| "__pstl_cuda_stable_partition: kernel launch of cub::DeviceTransform::TransformIf failed", | ||
| tuple<_InputIterator>{__first}, | ||
| __storage.template __get_ptr<1>(), | ||
| __count, | ||
| CUB_NS_QUALIFIER::detail::transform::always_true_predicate{}, | ||
| identity{}, | ||
| __stream.get()); |
There was a problem hiding this comment.
Suggestion: Could we just call cuda::std::copy here instead?
There was a problem hiding this comment.
I wanted to avoid that because of overhead and also nesting NVTX ranges, the call is so lightweight
There was a problem hiding this comment.
Also we cannot call the full algorithm, because they are synchronous and we do not want to introduce a stream synchronization here
| const size_t __num_selected = | ||
| __dispatch(__policy, ::cuda::std::move(__first), ::cuda::std::move(__last), ::cuda::std::move(__pred)); | ||
|
|
||
| return __result + static_cast<iter_difference_t<_InputIterator>>(__num_selected); |
There was a problem hiding this comment.
Suggestion: I would let __pstl_dispatch<__pstl_algorithm::__stable_partition, __execution_backend::__cuda>::operator() just return the iterator already instead of the number of selected items.
| template <class Policy> | ||
| void test_partition(const Policy& policy, thrust::device_vector<int>& input) |
There was a problem hiding this comment.
Important: Please also add a test that covers the stable nature of the algorithm. For example, partitioning a vector of pairs based on their first value, and then check whether the order of the second value was retained.
This implements the stable partition algorithm for the cuda backend.
std::stable_partitionsee https://en.cppreference.com/w/cpp/algorithm/stable_partition.htmlIt provides tests and benchmarks similar to Thrust and some boilerplate for libcu++
The functionality is publicly available yet and implemented in a private internal header
Fixes #8228