You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-15Lines changed: 21 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,9 +12,13 @@
12
12
13
13
14
14
15
-
_Kernel Launcher_ is a C++ library that enables dynamic compilation of _CUDA_ kernels at run time (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them in an easy type-safe way using C++ magic.
16
-
On top of that, Kernel Launcher supports _capturing_ kernel launches, to enable tuning by [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the tuning results, known as _wisdom_ files, back into the application.
17
-
The result: highly efficient GPU applications with maximum portability.
15
+
**Kernel Launcher** is a C++ library for dynamically compiling _CUDA_ kernels at runtime (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them using C++ magic in a way that is type-safe, user-friendly, and with minimal boilerplate.
16
+
17
+
18
+
On top of that, Kernel Launcher supports **tuning** the GPU kernels in your application.
19
+
This is done by **capturing** kernel launches, replaying them with an **auto-tuning tool** such as [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the results, saved as **wisdom** files, during runtime kernel compilation.
20
+
21
+
The result: **highly efficient** GPU applications with **maximum portability**.
18
22
19
23
20
24
@@ -25,11 +29,11 @@ Recommended installation is using CMake. See the [installation guide](https://ke
25
29
26
30
## Example
27
31
28
-
There are many ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples/](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
32
+
There are several ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples/](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
29
33
30
34
31
35
### Pragma-based API
32
-
Below shows an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
36
+
Below is an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
33
37
34
38
**kernel.cu**
35
39
```cpp
@@ -51,7 +55,7 @@ __global__ void vector_add(int n, T *C, const T *A, const T *B) {
51
55
#include "kernel_launcher.h"
52
56
53
57
int main() {
54
-
// Initialize CUDA memory. This is outside the scope of kernel_launcher.
58
+
// Initialize CUDA memory. This is outside the scope of Kernel Launcher.
55
59
unsigned int n = 1000000;
56
60
float *dev_A, *dev_B, *dev_C;
57
61
/* cudaMalloc, cudaMemcpy, ... */
@@ -61,7 +65,7 @@ int main() {
61
65
62
66
// Launch the kernel! Again, the grid size and block size do not need to
63
67
// be specified, they are calculated from the kernel specifications and
Below shows an example of the `KernelBuilder`-based API.
76
-
This offers more flexiblity than the pragma-based API, but is also more verbose:
80
+
This offers more flexibility than the pragma-based API, but is also more verbose:
77
81
78
82
**kernel.cu**
79
83
```cpp
@@ -114,9 +118,9 @@ int main() {
114
118
float *dev_A, *dev_B, *dev_C;
115
119
/* cudaMalloc, cudaMemcpy, ... */
116
120
117
-
// Launch the kernel! Note that kernel is compiled on the first call.
118
-
// The grid size and block size do not need to be specified, they are
119
-
// derived from the kernel specifications and run-time arguments.
121
+
// Launch the kernel! Note that the kernel is compiled on the first call.
122
+
// The grid size and block size do not need to be specified as they are
123
+
// derived from the kernel specifications and runtime arguments.
120
124
vector_add_kernel(n, dev_C, dev_A, dev_B);
121
125
}
122
126
```
@@ -136,12 +140,14 @@ If you use Kernel Launcher in your work, please cite the following publication:
136
140
137
141
As BibTeX:
138
142
139
-
```Latex
140
-
@article{heldens2023kernellauncher,
143
+
```latex
144
+
@inproceedings{heldens2023kernellauncher,
141
145
title={Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications},
142
146
author={Heldens, Stijn and van Werkhoven, Ben},
143
-
journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IPDPS 2023},
144
-
year={2023}
147
+
journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023},
Copy file name to clipboardExpand all lines: docs/examples/basic.rst
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ Here, we define two tunable parameters: the number of threads per block and the
49
49
:lineno-start: 15
50
50
51
51
The values returned by ``tune`` are placeholder objects.
52
-
These objects can be combined using C++ operators to create new expressions objects.
52
+
These objects can be combined using C++ operators to create new expression objects.
53
53
Note that ``elements_per_block`` does not actually contain a specific value;
54
54
instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of ``threads_per_block`` and ``elements_per_thread``.
55
55
@@ -59,10 +59,10 @@ instead, it is an abstract expression that, upon kernel instantiation, is evalua
59
59
60
60
Next, we define properties of the kernel such as block size and template arguments.
61
61
These properties can take on expressions, as demonstrated above.
62
-
The full list of properties is documented as :doc:`api/KernelBuilder`
62
+
The full list of properties is documented as :doc:`api/KernelBuilder`.
63
63
The following properties are supported:
64
64
65
-
* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
65
+
* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is it one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
66
66
* ``block_size``: A triplet ``(x, y, z)`` representing the block dimensions.
67
67
* ``grid_divisor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
68
68
* ``template_args``: This property specifies template arguments, which can be type names and integral values.
@@ -76,7 +76,7 @@ The following properties are supported:
76
76
:lineno-start: 26
77
77
78
78
The configuration defines the values of the tunable parameters to be used for compilation.
79
-
Here, the ``Config`` instance is constructed manually, but it could also be loaded from file or a tuning database.
79
+
Here, the ``Config`` instance is constructed manually, but it could also be loaded from a file or a tuning database.
80
80
81
81
.. literalinclude:: basic.cpp
82
82
:lines: 31-33
@@ -91,12 +91,12 @@ The ``Kernel`` instance should be stored, for example, in a class and only compi
91
91
92
92
To launch the kernel, we simply call ``launch``.
93
93
94
-
Alternatively, it is also possible to use the short-hand form::
94
+
Alternatively, it is also possible to use the shorthand form::
95
95
96
96
// Launch the kernel!
97
97
vector_add_kernel(n, dev_C, dev_A, dev_B);
98
98
99
-
To pass a CUDA stream use::
99
+
To pass a CUDA stream, use::
100
100
101
101
// Launch the kernel!
102
102
vector_add_kernel(stream, n, dev_C, dev_A, dev_B);
Copy file name to clipboardExpand all lines: docs/examples/wisdom.rst
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,15 +4,15 @@
4
4
Wisdom Files
5
5
============
6
6
7
-
In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the `blueprint` for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
7
+
In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the *blueprint* for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
8
8
9
9
10
10
However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used.
11
11
To address this problem, Kernel Launcher provides a solution in the form of **wisdom files** (terminology borrowed from `FFTW <http://www.fftw.org/>`_).
12
12
13
13
To use the Kernel Launcher's wisdom files, we need to run the application twice.
14
14
First, we **capture** the kernels that we want to tune, and then we use Kernel Tuner to tune those kernels.
15
-
Second, when we run the application again, but this time the kernel configuration is **selected** from the wisdom file that was generated during the tuning process.
15
+
Second, we run the application again, but this time the kernel configuration is **selected** from the wisdom file that was generated during the tuning process.
16
16
17
17
Let's see this in action.
18
18
@@ -34,7 +34,7 @@ main.cpp
34
34
Code Explanation
35
35
----------------
36
36
37
-
Notice how this example is similar to the previous example, with some minor differences such that ``kl::Kernel`` has been replaced by ``kl::WisdomKernel``.
37
+
Notice how this example is similar to the previous example, with some minor differences, such that ``kl::Kernel`` has been replaced by ``kl::WisdomKernel``.
38
38
We now highlight the important lines of this code example.
39
39
40
40
.. literalinclude:: wisdom.cpp
@@ -59,12 +59,12 @@ If no wisdom file can be found, the default configuration is used (in this examp
59
59
:lines: 25-26
60
60
:lineno-start: 25
61
61
62
-
The following two lines of code set global configuration for the application.
62
+
The following two lines of code set the global configuration for the application.
63
63
64
64
The function ``set_global_wisdom_directory`` sets the directory where Kernel Launcher will search for wisdom files associated with a compiled kernel.
65
65
In this example, the directory ``wisdom/`` is set as the wisdom directory, and Kernel Launcher will search for the file ``wisdom/vector_add_float.wisdom`` since ``vector_add_float`` is the tuning key.
66
66
67
-
The function ``set_global_capture_directory`` sets the directory where Kernel Launcher will store resulting files when capturing a kernel launch.
67
+
The function ``set_global_capture_directory`` sets the directory where Kernel Launcher will store the resulting files when capturing a kernel launch.
68
68
69
69
.. literalinclude:: wisdom.cpp
70
70
:lines: 28-30
@@ -97,7 +97,7 @@ See :doc:`../env_vars` for an overview and description of additional environment
97
97
98
98
Tune the kernel
99
99
---------------
100
-
To tune the kernel, run the Python script ``tune.py`` in the directory ``python/`` which uses `Kernel Tuner <https://kerneltuner.github.io/>`_ to tune the kernel.
100
+
To tune the kernel, run the Python script ``tune.py`` in the directory ``python/``, which uses `Kernel Tuner <https://kerneltuner.github.io/>`_ to tune the kernel.
101
101
To view all available options, use ``--help``.
102
102
For example, to spend 10 minutes tuning the kernel for the current GPU, use the following command::
103
103
@@ -109,7 +109,7 @@ To tune multiple kernels at once, use a wildcard::
109
109
110
110
If everything goes well, the script should run for ten minutes and eventually generate a file ``wisdom/vector_add_float.wisdom`` containing the tuning results.
111
111
Note that it is possible to tune the same kernel for different GPUs and problem sizes, and all results will be saved in the same wisdom file.
112
-
After tuning, the files in the ``captures/`` directory can be removed safely.
112
+
After tuning, the files in the ``captures/`` directory can be safely removed.
0 commit comments