Rework runtime and add Mppa target#47
Conversation
8718fbd to
bfb8226
Compare
|
PR #49 is required to run the CI on MacOS. |
bfb8226 to
0910617
Compare
|
Update SDist requirements |
6f1ffe9 to
eaac320
Compare
eaac320 to
4605de2
Compare
4605de2 to
54ab8a3
Compare
759afa8 to
74f3b3b
Compare
This goal of the work on the runtime is to facilitate the future
integration of non-host targets. New features are also added, especially
for accelerators.
Create common interfaces that runtimes will need to implement.
This patch introduces a runtime base named CommonRuntimeInterface which
is common among three derives classes of runtimes:
- Host (no derived class for now)
- AcceleratorDevice for accelerators. Instance of this class are
called devices.
- EmbeddedDevice for external embedded processors.
Instances of AcceleratorDevice and EmbeddedDevice are called devices.
Unlike the current lazy runtime resolution, the concept of devices will
allow to handle multiples accelerator of the same class.
Apply the new runtime interfaces on the two existing runtimes: - Create HostRuntime singleton class that derives from CommonRuntimeInterface - Create GPUDevice singleton (for now) class that derives from AcceleratorDevice to implement the GPU runtime. Some method implementations are shared accross the two, but adding a specific implemention for a particular runtime is now easier. The GPU target has been completely split from the Host target. To prevent code duplication due to a clear split between Host and GPU runtimes, common code portions have been factorized in utils. With this rework of the runtimes, the call path has been simplified, confusing classes like Executor/Evaluator and functions like load_and_evaluate have been removed.
In the context on computation offloaded on a accelerator device, the user can specify where input/output tensors live when the evaluation begins. This allows to simulate weights tensors to be transfered ahead of time. This feature is only supported for the MLIR backend, but setting a "memref.on_device" attribute.
Create a new Mppa compilation target for MLIR. This target is the first one to implement the ahead of time offloading of tensors on device. Create a new Mppa runtime derived from AcceleratorDevice. The runtime can be configured on various aspect (check MppaConfig class). Execution is supported on ISS, Qemu, and Hardware. Note: In order to use the Mppa target, the Kalray Core Toolchain must be installed. mlir_sdist and mlir_mppa must also be installed.
Rely on the kvxuks-catch pass to catch micro-kernels in replacement of the transform dialect based vectorization.
74f3b3b to
43d5d1f
Compare
|
@ElectrikSpace I do not understand the status of this review. The description seems to still depend on a review which was merged and specify to look only at the last commit. Though it seems that you rebased it already. But there are still 5 commits. |
|
I didn't update the description because the status of the dependent PR has been set to merged automatically, but I can remove it |
guillon
left a comment
There was a problem hiding this comment.
Thanks for the proposal, it's very cool.
In addition to inline comments:
- you may add to
docs/develop/optional_backends.mdin the sdist section possibly the way to test for mppa as we do not have automation there yet, I can verify if I can run it on our local machine. - also for the nvgpu, it seems that there are tests missing, can you add there also a section for this target and how to run it? I may try to run it on a grid 5000 machine with GPU.
| number: int, | ||
| min_repeat_ms: int, | ||
| cfunc: CFunc, | ||
| args_tuples: list[Any], |
There was a problem hiding this comment.
Please verify why arg_tuples there and args in evaluate differe
| from xtc.itf.runtime.common import CommonRuntimeInterface | ||
|
|
||
|
|
||
| class EmbeddedDevice(CommonRuntimeInterface, ABC): |
| """ | ||
| ... | ||
|
|
||
| # TODO |
| # | ||
| from .HostRuntime import HostRuntime | ||
|
|
||
| __all__ = ["HostRuntime"] |
There was a problem hiding this comment.
You do not Really need all there as the default is to expose all names except _*
| self._payload_name = payload_name | ||
| self._file_name = file_name | ||
| self._file_type = file_type | ||
| assert self._file_type == "shlib", "only support shlib for JIR Module" |
There was a problem hiding this comment.
Replace JIR Module by GPU Module
|
|
||
|
|
||
| def np_init(shape: tuple, dtype: str) -> numpy.typing.NDArray[Any]: | ||
| def np_init(shape: tuple, dtype: str, **attrs: Any) -> numpy.typing.NDArray[Any]: |
There was a problem hiding this comment.
Why do you need extra **attrs here?
| mppa = MppaDevice() | ||
|
|
||
| I, J, K, dtype = 4, 8, 16, "float32" | ||
| a = O.tensor((I, K), dtype, name="A") # A live son the host |
| b = O.tensor((K, J), dtype, name="B", device=mppa) # B lives on the accelerator | ||
|
|
||
| with O.graph(name="matmul") as gb: | ||
| O.matmul(a, b, name="C", device=mppa) # C msut lives on the accelerator |
| # CHECK-NEXT: module attributes {transform.with_named_sequence} { | ||
| # CHECK-NEXT: sdist.processor_mesh @processor_mesh from @memory_mesh = <["px"=1, "py"=1, "psx"=2, "psy"=8]> | ||
| # CHECK-NEXT: sdist.memory_mesh @memory_mesh = <["mx"=1, "my"=1]> | ||
| # CHECK-NEXT: func.func @matmul(%arg0: memref<4x16xf32> {llvm.noalias}, %arg1: memref<16x8xf32> {llvm.noalias, memref.on_device}, %arg2: memref<4x8xf32> {llvm.noalias, memref.on_device}) { |
There was a problem hiding this comment.
Is the buffer A transfered at some point to the accelerator, or just read directly from main memory?
| [ `uname -s` = Darwin ] || env XTC_MLIR_TARGET=nvgpu lit -v tests/filecheck/backends tests/filecheck/mlir_loop tests/filecheck/evaluation | ||
|
|
||
| check-lit-mppa: | ||
| env XTC_MLIR_TARGET=mppa lit -v -j 1 tests/filecheck/backends/target_mppa |
There was a problem hiding this comment.
I guess you should exclude darwin for this target, see above
This PR rework the runtime to facilitate the support of new targets, especially accelerators.
It also proposes an integration of the Kalray Mppa target.