Skip to content

Commit f925417

Browse files
committed
Improve grammar and spelling in documentation
1 parent 8746f8b commit f925417

File tree

3 files changed

+34
-28
lines changed

3 files changed

+34
-28
lines changed

README.md

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,13 @@
1212

1313

1414

15-
_Kernel Launcher_ is a C++ library that enables dynamic compilation of _CUDA_ kernels at run time (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them in an easy type-safe way using C++ magic.
16-
On top of that, Kernel Launcher supports _capturing_ kernel launches, to enable tuning by [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the tuning results, known as _wisdom_ files, back into the application.
17-
The result: highly efficient GPU applications with maximum portability.
15+
**Kernel Launcher** is a C++ library for dynamically compiling _CUDA_ kernels at runtime (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them using C++ magic in a way that is type-safe, user-friendly, and with minimal boilerplate.
16+
17+
18+
On top of that, Kernel Launcher supports **tuning** the GPU kernels in your application.
19+
This is done by **capturing** kernel launches, replaying them with an **auto-tuning tool** such as [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the results, saved as **wisdom** files, during runtime kernel compilation.
20+
21+
The result: **highly efficient** GPU applications with **maximum portability**.
1822

1923

2024

@@ -25,11 +29,11 @@ Recommended installation is using CMake. See the [installation guide](https://ke
2529

2630
## Example
2731

28-
There are many ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples/](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
32+
There are several ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples/](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
2933

3034

3135
### Pragma-based API
32-
Below shows an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
36+
Below is an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
3337

3438
**kernel.cu**
3539
```cpp
@@ -51,7 +55,7 @@ __global__ void vector_add(int n, T *C, const T *A, const T *B) {
5155
#include "kernel_launcher.h"
5256
5357
int main() {
54-
// Initialize CUDA memory. This is outside the scope of kernel_launcher.
58+
// Initialize CUDA memory. This is outside the scope of Kernel Launcher.
5559
unsigned int n = 1000000;
5660
float *dev_A, *dev_B, *dev_C;
5761
/* cudaMalloc, cudaMemcpy, ... */
@@ -61,7 +65,7 @@ int main() {
6165
6266
// Launch the kernel! Again, the grid size and block size do not need to
6367
// be specified, they are calculated from the kernel specifications and
64-
// run-time arguments.
68+
// runtime arguments.
6569
kl::launch(
6670
kl::PragmaKernel("vector_add", "kernel.cu", {"float"}),
6771
n, dev_C, dev_A, dev_B
@@ -73,7 +77,7 @@ int main() {
7377

7478
### Builder-based API
7579
Below shows an example of the `KernelBuilder`-based API.
76-
This offers more flexiblity than the pragma-based API, but is also more verbose:
80+
This offers more flexibility than the pragma-based API, but is also more verbose:
7781

7882
**kernel.cu**
7983
```cpp
@@ -114,9 +118,9 @@ int main() {
114118
float *dev_A, *dev_B, *dev_C;
115119
/* cudaMalloc, cudaMemcpy, ... */
116120
117-
// Launch the kernel! Note that kernel is compiled on the first call.
118-
// The grid size and block size do not need to be specified, they are
119-
// derived from the kernel specifications and run-time arguments.
121+
// Launch the kernel! Note that the kernel is compiled on the first call.
122+
// The grid size and block size do not need to be specified as they are
123+
// derived from the kernel specifications and runtime arguments.
120124
vector_add_kernel(n, dev_C, dev_A, dev_B);
121125
}
122126
```
@@ -136,12 +140,14 @@ If you use Kernel Launcher in your work, please cite the following publication:
136140
137141
As BibTeX:
138142

139-
```Latex
140-
@article{heldens2023kernellauncher,
143+
```latex
144+
@inproceedings{heldens2023kernellauncher,
141145
title={Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications},
142146
author={Heldens, Stijn and van Werkhoven, Ben},
143-
journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IPDPS 2023},
144-
year={2023}
147+
journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023},
148+
year={2023},
149+
pages={744-753},
150+
doi={10.1109/IPDPSW59300.2023.00126}}
145151
}
146152
```
147153

docs/examples/basic.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ Here, we define two tunable parameters: the number of threads per block and the
4949
:lineno-start: 15
5050

5151
The values returned by ``tune`` are placeholder objects.
52-
These objects can be combined using C++ operators to create new expressions objects.
52+
These objects can be combined using C++ operators to create new expression objects.
5353
Note that ``elements_per_block`` does not actually contain a specific value;
5454
instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of ``threads_per_block`` and ``elements_per_thread``.
5555

@@ -59,10 +59,10 @@ instead, it is an abstract expression that, upon kernel instantiation, is evalua
5959

6060
Next, we define properties of the kernel such as block size and template arguments.
6161
These properties can take on expressions, as demonstrated above.
62-
The full list of properties is documented as :doc:`api/KernelBuilder`
62+
The full list of properties is documented as :doc:`api/KernelBuilder`.
6363
The following properties are supported:
6464

65-
* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
65+
* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is it one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
6666
* ``block_size``: A triplet ``(x, y, z)`` representing the block dimensions.
6767
* ``grid_divisor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
6868
* ``template_args``: This property specifies template arguments, which can be type names and integral values.
@@ -76,7 +76,7 @@ The following properties are supported:
7676
:lineno-start: 26
7777

7878
The configuration defines the values of the tunable parameters to be used for compilation.
79-
Here, the ``Config`` instance is constructed manually, but it could also be loaded from file or a tuning database.
79+
Here, the ``Config`` instance is constructed manually, but it could also be loaded from a file or a tuning database.
8080

8181
.. literalinclude:: basic.cpp
8282
:lines: 31-33
@@ -91,12 +91,12 @@ The ``Kernel`` instance should be stored, for example, in a class and only compi
9191

9292
To launch the kernel, we simply call ``launch``.
9393

94-
Alternatively, it is also possible to use the short-hand form::
94+
Alternatively, it is also possible to use the shorthand form::
9595

9696
// Launch the kernel!
9797
vector_add_kernel(n, dev_C, dev_A, dev_B);
9898

99-
To pass a CUDA stream use::
99+
To pass a CUDA stream, use::
100100

101101
// Launch the kernel!
102102
vector_add_kernel(stream, n, dev_C, dev_A, dev_B);

docs/examples/wisdom.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44
Wisdom Files
55
============
66

7-
In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the `blueprint` for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
7+
In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the *blueprint* for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
88

99

1010
However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used.
1111
To address this problem, Kernel Launcher provides a solution in the form of **wisdom files** (terminology borrowed from `FFTW <http://www.fftw.org/>`_).
1212

1313
To use the Kernel Launcher's wisdom files, we need to run the application twice.
1414
First, we **capture** the kernels that we want to tune, and then we use Kernel Tuner to tune those kernels.
15-
Second, when we run the application again, but this time the kernel configuration is **selected** from the wisdom file that was generated during the tuning process.
15+
Second, we run the application again, but this time the kernel configuration is **selected** from the wisdom file that was generated during the tuning process.
1616

1717
Let's see this in action.
1818

@@ -34,7 +34,7 @@ main.cpp
3434
Code Explanation
3535
----------------
3636

37-
Notice how this example is similar to the previous example, with some minor differences such that ``kl::Kernel`` has been replaced by ``kl::WisdomKernel``.
37+
Notice how this example is similar to the previous example, with some minor differences, such that ``kl::Kernel`` has been replaced by ``kl::WisdomKernel``.
3838
We now highlight the important lines of this code example.
3939

4040
.. literalinclude:: wisdom.cpp
@@ -59,12 +59,12 @@ If no wisdom file can be found, the default configuration is used (in this examp
5959
:lines: 25-26
6060
:lineno-start: 25
6161

62-
The following two lines of code set global configuration for the application.
62+
The following two lines of code set the global configuration for the application.
6363

6464
The function ``set_global_wisdom_directory`` sets the directory where Kernel Launcher will search for wisdom files associated with a compiled kernel.
6565
In this example, the directory ``wisdom/`` is set as the wisdom directory, and Kernel Launcher will search for the file ``wisdom/vector_add_float.wisdom`` since ``vector_add_float`` is the tuning key.
6666

67-
The function ``set_global_capture_directory`` sets the directory where Kernel Launcher will store resulting files when capturing a kernel launch.
67+
The function ``set_global_capture_directory`` sets the directory where Kernel Launcher will store the resulting files when capturing a kernel launch.
6868

6969
.. literalinclude:: wisdom.cpp
7070
:lines: 28-30
@@ -97,7 +97,7 @@ See :doc:`../env_vars` for an overview and description of additional environment
9797

9898
Tune the kernel
9999
---------------
100-
To tune the kernel, run the Python script ``tune.py`` in the directory ``python/`` which uses `Kernel Tuner <https://kerneltuner.github.io/>`_ to tune the kernel.
100+
To tune the kernel, run the Python script ``tune.py`` in the directory ``python/``, which uses `Kernel Tuner <https://kerneltuner.github.io/>`_ to tune the kernel.
101101
To view all available options, use ``--help``.
102102
For example, to spend 10 minutes tuning the kernel for the current GPU, use the following command::
103103

@@ -109,7 +109,7 @@ To tune multiple kernels at once, use a wildcard::
109109

110110
If everything goes well, the script should run for ten minutes and eventually generate a file ``wisdom/vector_add_float.wisdom`` containing the tuning results.
111111
Note that it is possible to tune the same kernel for different GPUs and problem sizes, and all results will be saved in the same wisdom file.
112-
After tuning, the files in the ``captures/`` directory can be removed safely.
112+
After tuning, the files in the ``captures/`` directory can be safely removed.
113113

114114

115115

0 commit comments

Comments
 (0)