Skip to content

Conversation

@Hailey-Zh
Copy link

@Hailey-Zh Hailey-Zh commented Jan 20, 2026

Summary

This PR introduces support for Fused Neighborhood Attention (FNA) optimized specifically for NPU architectures. The implementation focuses on memory efficiency and hardware affinity to prevent performance bottlenecks. Key modifications include:

Grid Dimension Refactoring: Adjusted the attention grid to a 2D structure. This change optimizes thread block mapping and prevents User Buffer (UB) overflow, ensuring the workload fits within the NPU's local memory constraints.

NPU-Affinity Softmax: Refactored the Softmax tiling and grid dimensions to align with NPU compute unit sizes, maximizing throughput and reducing synchronization overhead.

Details

Testing Done

  • Hardware Type: < >NPU(910B3)
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Jan 22, 2026

Thank you! Could you also attach the benchmark results and keep comments in english?

@Hailey-Zh
Copy link
Author

Thank you! Could you also attach the benchmark results and keep comments in english?

This is currently a draft and there are still a few outstanding issues to resolve. I will make sure to include the benchmark results and switch all comments to English in the final official version.

@lowdy1
Copy link
Contributor

lowdy1 commented Feb 5, 2026

Since we’re currently focused on the Ascend CI, and this kernel is still unworkable, I was wondering if you have bandwidth to keep working on it. If you’d like, maybe we could also help move it forward.

@Hailey-Zh
Copy link
Author

Since we’re currently focused on the Ascend CI, and this kernel is still unworkable, I was wondering if you have bandwidth to keep working on it. If you’d like, maybe we could also help move it forward.

we'll keep working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants