Releases: sgl-project/sgl-kernel-npu
Releases · sgl-project/sgl-kernel-npu
20251128
What's Changed
- add test internode for deepep by @zuje123 in #193
- Support run normal mode deepep on a single A2 machine by @luanyundu in #201
- [Test] Testing the generalization of fused moe by @kaniel-outis in #167
- Add whl packages to Github Release by @BourneSun0527 in #204
- Add two scripts by @DubhepPan in #119
- support long cat on a3 by @luanyundu in #182
- calculate dispatch normal input parameters using npu instead of cpu by @lih827 in #177
- Add alloc_extend_kernel by @hw-csong in #196
- Modify deepep README_CN.md by @oagniqgnat in #187
- notify_dispatch kernel change magic from int32_t to uint64_t by @zuje123 in #202
Full Changelog: 2025112...2025112
20251120
What's Changed
- dispatch and combine batchsize support 4096 for A2 by @ruiqiangworking in #173
- remove redundant check by @ruiqiangworking in #175
- optimize deepep setup, package name with cann version by @zuje123 in #178
- deepep low_latency d&c support a2 single server by @zuje123 in #176
- Add README files for mlapo and batch_transpose_matmul by @randgun in #104
- Support device with different counts of AICore (FusedDeepMoe operator) by @wangqiankun13 in #180
- Add triton decode attention kernels by @RuixuanZhang06 in #184
- fix cann version check by @hustmf in #188
- update a verification of HCCL_BUFFSIZE for moe by @goosj in #183
- op transfer kv fixbug by @husf1130 in #194
- add_norm_bias and split_qkv_norm_rope for qwen3 by @chenxu140 in #157
- [Chore] Upgrade CANN to 8.3.RC1 by @iforgetmyname in #195
New Contributors
- @hustmf made their first contribution in #188
- @chenxu140 made their first contribution in #157
Full Changelog: 2025111...2025112
20251110
What's Changed
- Added custom low_latency operators for dispatch/combine in the A2 dec… by @oagniqgnat in #166
- deepep support internode api by @zuje123 in #169
- add layout to ops2 directory by @luanyundu in #171
- Modified the deep_ep README and add A2 operator performance data. by @oagniqgnat in #168
- feat: add verify_tree_greedy_kernel triton kernel by @ranjiewen in #165
- optimize a2 layered combine kernel code by @ruiqiangworking in #172
- feat:tiny bugfix&Performance Optimization by @Yael-X in #170
New Contributors
- @ruiqiangworking made their first contribution in #172
Full Changelog: 2025110...2025111
20251106
What's Changed
- Add dependency on the moe header file of CANN by @DubhepPan in #152
- support small bs = 1 or 2 by @wangyibo1005 in #150
- feat:adapt x86_64 compilation by @Yael-X in #143
- [DFX] Compatible with CAN 8.2 and CAN 8.3 by @kaniel-outis in #158
- add mla_preprocess test script by @LinyuanLi0046 in #153
- [DFX] adapt cann8.3 by @kaniel-outis in #159
- [bugfix] swiglu quant by @Liwansi in #162
- [New Ops] build tree efficient by @hw-csong in #161
- support shallow fused topk=-1 by @wangyibo1005 in #160
- support kvcacheio by @husf1130 in #163
- improve layout kernel on a2 by @luanyundu in #164
New Contributors
- @DubhepPan made their first contribution in #152
- @Liwansi made their first contribution in #162
- @hw-csong made their first contribution in #161
Full Changelog: 2025103...2025110
20251030
What's Changed
- add a2 dispatch layout and update its test by @luanyundu in #149
- support topk=-1 by @wangyibo1005 in #132
- add env to decide whether send out prefix sum or not by @luanyundu in #151
- refactor: make hiddenStateDim a class member in MlaTilingData, Follow up closed PR#82 by @LinyuanLi0046 in #133
- support cachemode int8_nzcache with bf16 in mla_preprocess by @LinyuanLi0046 in #135
- add op transfer_kv_dim_exchange by @husf1130 in #148
- impl fused_swiglu_quant with group_list for deepep-low-latency by @xiaobaicxy in #155
- [Kernel] add Flash-Linear-Attention/layernorm_gated Triton op by @iforgetmyname in #154
New Contributors
- @LinyuanLi0046 made their first contribution in #133
- @husf1130 made their first contribution in #148
- @xiaobaicxy made their first contribution in #155
Full Changelog: 2025102...2025103
20251023
What's Changed
- Change the padding generation from randperm back to arange by @oagniqgnat in #140
- LoRA: moving kernels from vllm-ascend repo by @vlserov in #128
- Update README.md of DeepEp by @goosj in #144
New Contributors
Full Changelog: 2025102...2025102
20251022
What's Changed
- Update README.md: Add performace of normal and low latency dispatch/combine by @oagniqgnat in #106
- Support debug info for build by @jia-rundong in #99
- Update README by @oagniqgnat in #115
- Synchronous fusion moe by @kaniel-outis in #108
- Fix the severe performance degradation issue of the top9 dispatch in normal mode compared to top8. by @oagniqgnat in #117
- feat:add moe fused operator test draft by @Yael-X in #120
- mlapo fit different hidden state dim by @Todobe in #82
- Not use download.pytorch.org by @jia-rundong in #121
- EPLB for fused_deep_moe by @wangyibo1005 in #116
- [FusedDeepMoe] Support EPLB by @kaniel-outis in #118
- Support different token hidden sizes and gmm hidden sizes [FusedDeepMoe Operator] by @wangqiankun13 in #123
- Delete left useless code [FusedDeepMoe Operator] by @wangqiankun13 in #129
- update qwen3-next performance kernels by @iforgetmyname in #130
- [Bugfix] Remove unused code that causes split failure in Qwen3-Next by @iforgetmyname in #142
New Contributors
- @Todobe made their first contribution in #82
- @wangqiankun13 made their first contribution in #123
Full Changelog: 2025092...2025102