Intel Hardware Support; oneAPI/Level Zero/Intel NEO #24873
daedaevibin
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
|
@daedaevibin I suggest using level-zero API in SYCL backend:
llama.cpp prefer to support more users. Thank you! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am looking to contribute C++ optimization work for Intel compute hardware here and want to gauge where the community's current pain points are.
I see two distinct paths forward and would be happy to tackle either (or both):
ze_api) Backend: Bypassing the SYCL runtime entirely by implementing a directggml-levelzerobackend targeting the Intel NEO runtime. This would eliminate the multi-gigabyte oneAPI toolchain requirement for end-users, drastically lower driver-overhead latency, and bring deployment in line with the minimal, dependency-free philosophy ofllama.cpp.ggml-sycl.cpphost-side layer. Specifically looking at reducing latency bottlenecks during the prompt-processing to token-generation handoff, and refining Unified Shared Memory (USM) allocation and device synchronization barriers over the underlying Level Zero driver.Are there structural roadblocks that have kept a raw Level Zero backend from being pursued yet, or is a highly optimized SYCL layer the preferred trajectory for the core team? I have the dev environment ready to benchmark and trace execution directly.
Related: #23313 & #5277
Current Testing Environment
Here is the local hardware setup I am utilizing for initial benchmarking, optimizations, and implementation testing:
7.0.12-1-cachyos-eevdf-lto/dev/nvme1n1p2)xedriver (Mesa 26.1.2-arch3.1)🧪 Call for Community Testers & Contributors
While this integrated setup is perfect for debugging low-level command queues, memory abstractions, and direct driver execution paths over the modern
xestack, testing across a wider matrix of Intel hardware is crucial.If you have access to any of the following hardware and are willing to help test, benchmark, or collaborate on PR iterations, please reply to this thread:
i915kernel driver for older generation runtimes.Any data points on latency, kernel compilation overhead, or throughput changes will be highly valuable once the code adjustments or backend prototypes go live!
Beta Was this translation helpful? Give feedback.
All reactions