Support a custom comparison operator in `DeviceReduce::ArgMin|Max` by bernhardmgruber · Pull Request #8285 · NVIDIA/cccl

bernhardmgruber · 2026-04-02T23:09:01Z

bernhardmgruber · 2026-04-02T23:14:00Z

cub/cub/device/device_reduce.cuh

+    // TODO(bgruber): this constraint is not accurate, since the implementation will compare the value types of
+    // ExtremumOutIteratorT, which is wrong IMO
+    ::cuda::std::enable_if_t<::cuda::std::indirectly_comparable<InputIteratorT, InputIteratorT, CompareOpT>, int> = 0>


Instead of InputIteratorT we should use non_void_value_t<ExtremumOutIteratorT, it_value_t<InputIteratorT>>, but that just "feels" wrong here. But this is what the implementation does. What do the reviewers think?

I think the implementation should actually be changed to compare the input values, not the converted ones.

I do not follow why that constraint is wrong? We want to ensure that the input sequence is comparable with the passed operator. Why should we compare the ExtremumOutputIteratorT

The reduction implementation does not call compare_op(d_in[i], d_in[j]), it calls something like:

using input_value_t = it_value_t<InputIteratorT>; using accum_t = non_void_value_t<ExtremumOutIteratorT, input_value_t>; accum_t a = d_in[i]; accum_t b = d_in[j]; compare_op(a, b);

So it performs a conversion of the input value to the output iterator's value_type before comparing. That can be a totally different type.

I think this is a bug itself, but outside the scope of this PR.

NaderAlAwar

Suggestion: the issue being closed mentions ArgMax as well in the title, but this PR only appears to add public custom-comparator overloads and
test coverage for ArgMin. The internal refactor is more general, but DeviceReduce::ArgMax still seems to expose only the old no-comparator API. I would either create a separate issue for ArgMax or expose the custom comparator overload as well.

bernhardmgruber · 2026-04-04T20:32:33Z

Suggestion: the issue being closed mentions ArgMax as well in the title, but this PR only appears to add public custom-comparator overloads and test coverage for ArgMin.

So I temporarily added a new overload for ArgMax as well, but then I noticed that the implementation is actually identical to ArgMin(..., std::not_fn(compare_op), ...) and then I wondered whether a simple negation of the predicate deserves another public API overload.

Also, while ArgMin without a comparison operator defaults to std::less, what should ArgMax default to? std::max_element defaults to std::less to find the maximum, but maybe that's irritating to some users on a first glance. I thought not adding ArgMax would just avoid the confusion.

Finally, I considered naming the new overload not ArgMin but something like ArgReduce, but that doesn't make any sense either, since the user does not specify the reduction, but the comparison predicate. Maybe we should just call the overload ArgExtremum to distinguish it from ArgMin. It generalizes both ArgMin and ArgMax. @NaderAlAwar let me know what you think!

I would either create a separate issue for ArgMax or expose the custom comparator overload as well.

As my last paragraph points out, the new overload actually generalizes over both, ArgMin and ArgMax, so no more work should be necessary. An ArgMax with a custom comparison is essentially calling ArgMin with that operator.

NaderAlAwar · 2026-04-06T14:06:35Z

@bernhardmgruber those are good points, I hadn't considered that. Looking into this some more, since the standard library and Thrust already expose comparator overloads for both min_element and max_element, I think matching that symmetry here would be less surprising to users than only exposing a comparator overload on ArgMin. Since the public CUB API already has both ArgMin and ArgMax, I would expect custom-comparator
support to be available on both as well.

My worry about ArgExtremum is that it's name may be less familiar to users which could lead to them avoiding using it.

std::max_element defaults to std::less to find the maximum, but maybe that's irritating to some users on a first glance.

I do agree that this is a little confusing. I don't feel too strongly about this either way since I have not used this extensively but my intuition would be to stay consistent with existing standards unless we believe they are broken in some way.

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

miscco · 2026-04-07T08:49:40Z

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

+  // Initial value for empty problems, according to documented contract
+  const auto empty_problem_extremum = static_cast<output_extremum_t>([] {
+    if constexpr (::cuda::std::is_same_v<ReductionOpT, arg_min>)
+    {
+      return ::cuda::std::numeric_limits<input_value_t>::max();
+    }
+    else if constexpr (::cuda::std::is_same_v<ReductionOpT, arg_max>)
+    {
+      return ::cuda::std::numeric_limits<input_value_t>::lowest();
+    }
+    else
+    {
+      return input_value_t{};
+    }
+  }());
+  auto initial_value = empty_problem_init_t<per_partition_accum_t>{{PerPartitionOffsetT{1}, empty_problem_extremum}};


I am really unhappy that we actually need an initial value

It's only needed for the case where the user passes num_items == 0.

I would love for us to change the implementation so that in the legacy API without a comparison operator we do the return value thing and for the new API we only return indices, which in that case can just be 0

cub/cub/device/device_reduce.cuh

miscco · 2026-04-07T08:51:57Z

cub/cub/device/device_reduce.cuh

+    // TODO(bgruber): this constraint is not accurate, since the implementation will compare the value types of
+    // ExtremumOutIteratorT, which is wrong IMO
+    ::cuda::std::enable_if_t<::cuda::std::indirectly_comparable<InputIteratorT, InputIteratorT, CompareOpT>, int> = 0>


I do not follow why that constraint is wrong? We want to ensure that the input sequence is comparable with the passed operator. Why should we compare the ExtremumOutputIteratorT

cub/cub/device/device_reduce.cuh

miscco · 2026-04-07T08:54:35Z

cub/cub/device/device_reduce.cuh

+    cudaStream_t stream = 0)
+  {
+    return ArgMax(
+      d_temp_storage, temp_storage_bytes, d_in, d_max_out, d_index_out, num_items, ::cuda::std::less{}, stream);


Ditto use typed less

miscco · 2026-04-07T08:55:51Z

cub/cub/thread/thread_operators.cuh

+// Less-than comparator for an index/value pair that compares values first, and indices when the values are equal
+template <typename ValueLessThen = ::cuda::std::less<>>
+struct arg_less : ValueLessThen
 {
-  /// Boolean max operator, preferring the item having the smaller offset in
-  /// case of ties
  template <typename T, typename OffsetT>
  _CCCL_HOST_DEVICE _CCCL_FORCEINLINE ::cuda::std::pair<OffsetT, T>
  operator()(const ::cuda::std::pair<OffsetT, T>& a, const ::cuda::std::pair<OffsetT, T>& b) const
  {
-    if ((b.second > a.second) || ((a.second == b.second) && (b.first < a.first)))
+    const auto& less = static_cast<const ValueLessThen&>(*this);
+    if (less(b.second, a.second) || (!less(a.second, b.second) && b.first < a.first))
    {
      return b;
    }

    return a;
  }


Important: Inheritance is almost always worse than making it a member.

But with a member, we are not getting EBCO. But maybe that's not so important here.

Hmm, I think if I move to a data member, aggregate init would no longer work with the deduction guide in C++17. This can be worked around of course. Do you insist on this change, or can I save myself 43s of typing?

we can keep it as is

miscco · 2026-04-07T08:57:13Z

cub/cub/thread/thread_operators.cuh

+
+//! @brief Binary functor swapping the arguments to ``operator()`` before forwarding to an inner functor
+template <typename Predicate>
+struct swap_args : Predicate


Question: Why arent we just using not_fn

Because non_fun(less{}) is not the same as greater{}, it's greater_equal{}. It should actually not matter, since we are returning the first element that matches the predicate. But I felt swapping arguments is more true.

bernhardmgruber · 2026-04-07T14:27:52Z

          Start  65: cub.test.device.reduce.lid_0.types_0
   59/177 Test  #65: cub.test.device.reduce.lid_0.types_0 ...........................   Passed  3285.96 sec

Seems a bit excessive: https://github.com/NVIDIA/cccl/actions/runs/24074951429/job/70228571613?pr=8285

Fixes: NVIDIA#6123

cub/test/catch2_test_device_reduce.cu

bernhardmgruber · 2026-04-08T10:35:37Z

Seems a bit excessive: https://github.com/NVIDIA/cccl/actions/runs/24074951429/job/70228571613?pr=8285

Solved

github-actions · 2026-04-08T12:20:35Z

🥳 CI Workflow Results

🟩 Finished in 1h 42m: Pass: 100%/269 | Total: 9d 10h | Max: 1h 37m | Hits: 62%/176731

See results here.

…VIDIA#8285) Fixes: NVIDIA#6123

bernhardmgruber requested review from a team as code owners April 2, 2026 23:09

bernhardmgruber requested a review from oleksandr-pavlyk April 2, 2026 23:09

github-project-automation bot added this to CCCL Apr 2, 2026

bernhardmgruber requested a review from srinivasyadav18 April 2, 2026 23:09

github-project-automation bot moved this to Todo in CCCL Apr 2, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 2, 2026

bernhardmgruber changed the title ~~Support a custom comparison predicate in DeviceReduce::ArgMin~~ Support a custom comparison operator in DeviceReduce::ArgMin Apr 2, 2026

bernhardmgruber commented Apr 2, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

NaderAlAwar approved these changes Apr 3, 2026

View reviewed changes

bernhardmgruber force-pushed the ref_argmin branch from 6814c46 to 77a24f4 Compare April 4, 2026 20:03

bernhardmgruber mentioned this pull request Apr 4, 2026

Port thrust::min|max_element to CUB #8291

Open

1 task

This comment has been minimized.

Sign in to view

miscco reviewed Apr 7, 2026

View reviewed changes

bernhardmgruber changed the title ~~Support a custom comparison operator in DeviceReduce::ArgMin~~ Support a custom comparison operator in DeviceReduce::ArgMin|Max Apr 7, 2026

bernhardmgruber enabled auto-merge (squash) April 7, 2026 11:36

This comment has been minimized.

Sign in to view

bernhardmgruber added 8 commits April 8, 2026 12:10

Refactor dispatch_streaming_arg_reduce

5e3732b

Support a custom comparison predicate in DeviceReduce::ArgMin

4c7b610

Fixes: NVIDIA#6123

fixes

ff1e7d7

Revert name

6b79112

static cast

84331c0

Unify ArgMin|Max paths

eddc1d8

MSVC

6e4e9f9

ArgMax

47a6436

bernhardmgruber added 3 commits April 8, 2026 12:10

Docs

41e7f09

:cuda::std::numeric_limits<input_value_t>::is_specialized

5da3700

fix bug

6eb17ff

bernhardmgruber commented Apr 8, 2026

View reviewed changes

cub/test/catch2_test_device_reduce.cu Outdated Show resolved Hide resolved

Avoid capturing vector

ee9c2a4

bernhardmgruber force-pushed the ref_argmin branch from 8b840ad to ee9c2a4 Compare April 8, 2026 10:35

bernhardmgruber merged commit a003464 into NVIDIA:main Apr 8, 2026
287 of 289 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Apr 8, 2026

bernhardmgruber deleted the ref_argmin branch April 8, 2026 13:22

gonidelis pushed a commit to gonidelis/cccl that referenced this pull request Apr 8, 2026

Support a custom comparison operator in DeviceReduce::ArgMin|Max (N…

9f6ccb2

…VIDIA#8285) Fixes: NVIDIA#6123

Conversation

bernhardmgruber commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Apr 4, 2026

Uh oh!

This comment has been minimized.

NaderAlAwar commented Apr 6, 2026

Uh oh!

This comment has been minimized.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Apr 7, 2026

Uh oh!

This comment has been minimized.

Uh oh!

bernhardmgruber commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 42m: Pass: 100%/269 | Total: 9d 10h | Max: 1h 37m | Hits: 62%/176731

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bernhardmgruber Apr 7, 2026 •

edited

Loading

bernhardmgruber Apr 7, 2026 •

edited

Loading