Improve grammar and spelling in documentation

stijnh · stijnh · commit f9254170ca8b · 2025-09-09T21:42:25.000+02:00
diff --git a/README.md b/README.md
@@ -12,9 +12,13 @@
 
 
 
-_Kernel Launcher_ is a C++ library that enables dynamic compilation of _CUDA_ kernels at run time (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them in an easy type-safe way using C++ magic.
-On top of that, Kernel Launcher supports _capturing_ kernel launches, to enable tuning by [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the tuning results, known as _wisdom_ files, back into the application.
-The result: highly efficient GPU applications with maximum portability.
+**Kernel Launcher** is a C++ library for dynamically compiling _CUDA_ kernels at runtime (using [NVRTC](https://docs.nvidia.com/cuda/nvrtc/index.html)) and launching them using C++ magic in a way that is type-safe, user-friendly, and with minimal boilerplate.
+
+
+On top of that, Kernel Launcher supports **tuning** the GPU kernels in your application.
+This is done by **capturing** kernel launches, replaying them with an **auto-tuning tool** such as [Kernel Tuner](https://github.com/KernelTuner/kernel_tuner), and importing the results, saved as **wisdom** files, during runtime kernel compilation.
+
+The result: **highly efficient** GPU applications with **maximum portability**.
 
 
 
@@ -25,11 +29,11 @@ Recommended installation is using CMake. See the [installation guide](https://ke
 
 ## Example
 
-There are many ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples/](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
+There are several ways of using Kernel Launcher. See the documentation for [examples](https://kerneltuner.github.io/kernel_launcher/example.html) or check out the [examples/](https://github.com/KernelTuner/kernel_launcher/tree/master/examples) directory.
 
 
 ### Pragma-based API
-Below shows an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
+Below is an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.
 
 **kernel.cu**
 ```cpp
@@ -51,7 +55,7 @@ __global__ void vector_add(int n, T *C, const T *A, const T *B) {
 #include "kernel_launcher.h"
 
 int main() {
-    // Initialize CUDA memory. This is outside the scope of kernel_launcher.
+    // Initialize CUDA memory. This is outside the scope of Kernel Launcher.
     unsigned int n = 1000000;
     float *dev_A, *dev_B, *dev_C;
     /* cudaMalloc, cudaMemcpy, ... */
@@ -61,7 +65,7 @@ int main() {
 
     // Launch the kernel! Again, the grid size and block size do not need to
     // be specified, they are calculated from the kernel specifications and
-    // run-time arguments.
+    // runtime arguments.
     kl::launch(
         kl::PragmaKernel("vector_add", "kernel.cu", {"float"}),
         n, dev_C, dev_A, dev_B
@@ -73,7 +77,7 @@ int main() {
 
 ### Builder-based API
 Below shows an example of the `KernelBuilder`-based API.
-This offers more flexiblity than the pragma-based API, but is also more verbose:
+This offers more flexibility than the pragma-based API, but is also more verbose:
 
 **kernel.cu**
 ```cpp
@@ -114,9 +118,9 @@ int main() {
     float *dev_A, *dev_B, *dev_C;
     /* cudaMalloc, cudaMemcpy, ... */
 
-    // Launch the kernel! Note that kernel is compiled on the first call.
-    // The grid size and block size do not need to be specified, they are
-    // derived from the kernel specifications and run-time arguments.
+    // Launch the kernel! Note that the kernel is compiled on the first call.
+    // The grid size and block size do not need to be specified as they are
+    // derived from the kernel specifications and runtime arguments.
     vector_add_kernel(n, dev_C, dev_A, dev_B);
 }
 ```
@@ -136,12 +140,14 @@ If you use Kernel Launcher in your work, please cite the following publication:
 
 As BibTeX:
 
-```Latex
-@article{heldens2023kernellauncher,
+```latex
+@inproceedings{heldens2023kernellauncher,
   title={Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications},
   author={Heldens, Stijn and van Werkhoven, Ben},
-  journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IPDPS 2023},
-  year={2023}
+  journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023},
+  year={2023},
+  pages={744-753},
+  doi={10.1109/IPDPSW59300.2023.00126}}
 }
 ```
 
diff --git a/docs/examples/basic.rst b/docs/examples/basic.rst
@@ -49,7 +49,7 @@ Here, we define two tunable parameters: the number of threads per block and the
    :lineno-start: 15
 
 The values returned by ``tune`` are placeholder objects.
-These objects can be combined using C++ operators to create new expressions objects.
+These objects can be combined using C++ operators to create new expression objects.
 Note that ``elements_per_block`` does not actually contain a specific value;
 instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of ``threads_per_block`` and ``elements_per_thread``.
 
@@ -59,10 +59,10 @@ instead, it is an abstract expression that, upon kernel instantiation, is evalua
 
 Next, we define properties of the kernel such as block size and template arguments. 
 These properties can take on expressions, as demonstrated above. 
-The full list of properties is documented as :doc:`api/KernelBuilder`
+The full list of properties is documented as :doc:`api/KernelBuilder`.
 The following properties are supported:
 
-* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
+* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is it one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
 * ``block_size``: A triplet ``(x, y, z)`` representing the block dimensions.
 * ``grid_divisor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
 * ``template_args``: This property specifies template arguments, which can be type names and integral values.
@@ -76,7 +76,7 @@ The following properties are supported:
    :lineno-start: 26
 
 The configuration defines the values of the tunable parameters to be used for compilation.
-Here, the ``Config`` instance is constructed manually, but it could also be loaded from file or a tuning database.
+Here, the ``Config`` instance is constructed manually, but it could also be loaded from a file or a tuning database.
 
 .. literalinclude:: basic.cpp
    :lines: 31-33
@@ -91,12 +91,12 @@ The ``Kernel`` instance should be stored, for example, in a class and only compi
 
 To launch the kernel, we simply call ``launch``.
 
-Alternatively, it is also possible to use the short-hand form::
+Alternatively, it is also possible to use the shorthand form::
 
         // Launch the kernel!
         vector_add_kernel(n, dev_C, dev_A, dev_B);
 
-To pass a CUDA stream use::
+To pass a CUDA stream, use::
 
         // Launch the kernel!
         vector_add_kernel(stream, n, dev_C, dev_A, dev_B);
diff --git a/docs/examples/wisdom.rst b/docs/examples/wisdom.rst
@@ -4,15 +4,15 @@
 Wisdom Files
 ============
 
-In the previous example, we demonstrated how to compile a kernel by providing both a  ``KernelBuilder`` instance (describing the `blueprint` for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
+In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the *blueprint* for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
 
 
 However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used. 
 To address this problem, Kernel Launcher provides a solution in the form of **wisdom files** (terminology borrowed from `FFTW <http://www.fftw.org/>`_).
 
 To use the Kernel Launcher's wisdom files, we need to run the application twice. 
 First, we **capture** the kernels that we want to tune, and then we use Kernel Tuner to tune those kernels. 
-Second, when we run the application again, but this time the kernel configuration is **selected** from the wisdom file that was generated during the tuning process.
+Second, we run the application again, but this time the kernel configuration is **selected** from the wisdom file that was generated during the tuning process.
 
 Let's see this in action.
 
@@ -34,7 +34,7 @@ main.cpp
 Code Explanation
 ----------------
 
-Notice how this example is similar to the previous example, with some minor differences such that ``kl::Kernel`` has been replaced by ``kl::WisdomKernel``.
+Notice how this example is similar to the previous example, with some minor differences, such that ``kl::Kernel`` has been replaced by ``kl::WisdomKernel``.
 We now highlight the important lines of this code example.
 
 .. literalinclude:: wisdom.cpp
@@ -59,12 +59,12 @@ If no wisdom file can be found, the default configuration is used (in this examp
    :lines: 25-26
    :lineno-start: 25
    
-The following two lines of code set global configuration for the application.
+The following two lines of code set the global configuration for the application.
 
 The function ``set_global_wisdom_directory`` sets the directory where Kernel Launcher will search for wisdom files associated with a compiled kernel. 
 In this example, the directory ``wisdom/`` is set as the wisdom directory, and Kernel Launcher will search for the file ``wisdom/vector_add_float.wisdom`` since ``vector_add_float`` is the tuning key.
 
-The function ``set_global_capture_directory`` sets the directory where Kernel Launcher will store resulting files when capturing a kernel launch.
+The function ``set_global_capture_directory`` sets the directory where Kernel Launcher will store the resulting files when capturing a kernel launch.
 
 .. literalinclude:: wisdom.cpp
    :lines: 28-30
@@ -97,7 +97,7 @@ See :doc:`../env_vars` for an overview and description of additional environment
 
 Tune the kernel
 ---------------
-To tune the kernel, run the Python script ``tune.py`` in the directory ``python/`` which uses `Kernel Tuner <https://kerneltuner.github.io/>`_ to tune the kernel.
+To tune the kernel, run the Python script ``tune.py`` in the directory ``python/``, which uses `Kernel Tuner <https://kerneltuner.github.io/>`_ to tune the kernel.
 To view all available options, use ``--help``.
 For example, to spend 10 minutes tuning the kernel for the current GPU, use the following command::
 
@@ -109,7 +109,7 @@ To tune multiple kernels at once, use a wildcard::
 
 If everything goes well, the script should run for ten minutes and eventually generate a file ``wisdom/vector_add_float.wisdom`` containing the tuning results.
 Note that it is possible to tune the same kernel for different GPUs and problem sizes, and all results will be saved in the same wisdom file.
-After tuning, the files in the ``captures/`` directory can be removed safely.
+After tuning, the files in the ``captures/`` directory can be safely removed.