[Qwen3-next] radix cache v2 for qwen3-next #14792

hanming-lu · 2025-12-10T06:27:49Z

Motivation

Currently, qwen3-next doesn't support overlap scheduler or branching point caching

Modifications

Support overlap scheduler for qwen3-next
Support branching point caching for qwen3-next
tested for (ps = 1, ps > 1) x (non-spec dec, sd topk1, sd topk>1). All work except for ps > 1 + sd topk > 1, which is not supported on main yet
enable 1) and 2) by --enable-mamba-radix-cache-v2
Better memory allocation for ssm spec dec - instead of coupling spec dec intermediate state size with total ssm states, couple it with max running requests.

Accuracy Tests

Added mamba radix cache KL tests for prefill and decode
cover both --enable-mamba-radix-cache-v2 on and off

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-10T06:27:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/managers/scheduler_output_processor_mixin.py

hanming-lu · 2025-12-11T01:30:40Z

/tag-and-rerun-ci

yizhang2077 · 2025-12-10T14:40:13Z

python/sglang/srt/managers/schedule_batch.py

+                mask = (req.extend_input_len // FLA_CHUNK_SIZE) * FLA_CHUNK_SIZE > 0
+                mamba_track_mask_cpu.append(mask)
+                mamba_track_indices_cpu.append(
+                    req.mamba_ping_pong_track_buffer[req.mamba_next_track_idx].item()


it will cause device/host sync here?

will it break overlap schedule?

Checked the profile, before it, some functions also triggers sync such as alloc_for_extend(). This new one (the one with arrow) is pretty small compared to existing ones, so no impact.

python/sglang/srt/managers/schedule_batch.py

…ct/sglang into hanming/mamba-cache-p1

yizhang2077 · 2025-12-11T09:46:45Z

python/sglang/srt/server_args.py

    swa_full_tokens_ratio: float = 0.8
    disable_hybrid_swa_memory: bool = False
    radix_eviction_policy: str = "lru"
+    mamba_track_interval: int = 256


duplicate define

yizhang2077 · 2025-12-11T09:58:35Z

test/srt/models/test_qwen3_next_models.py

+            cached_tokens = result["meta_info"]["cached_tokens"]
+            if cache_hit:
+                assert (
+                    cached_tokens > 0


I think the shape of tree in test can be more complex, and cached_tokens we can directly predict

yizhang2077 · 2025-12-11T11:25:48Z

python/sglang/srt/managers/schedule_batch.py

        self.extend_logprob_start_lens = [r.extend_logprob_start_len for r in reqs]
        self.extend_input_logprob_token_ids = extend_input_logprob_token_ids

+        if get_global_server_args().enable_mamba_radix_cache_v2:


I think this can also be wrapped in _mamba_radix_cache_v2_prepare_for_extend

yizhang2077 · 2025-12-11T11:32:23Z

python/sglang/srt/mem_cache/mamba_radix_cache.py


        # copy mamba state to req local space if cow is true
        if cow_mamba and last_node.mamba_value is not None:
+            assert req.req_pool_idx is None  # req_pool_idx is uninitialed


remove this assertion

yizhang2077 · 2025-12-11T11:37:28Z

python/sglang/srt/mem_cache/mamba_radix_cache.py

+        # does not have a mamba value.
+        if len(value) > best_value_len:
+            fla_chunk_aligned_seqlen = (
+                sum(len(v) for v in value) // FLA_CHUNK_SIZE


why we do not use (len(value) // FLA_CHUN_SIZE) * FLA_CHUN_SIZE directly here?

yizhang2077 · 2025-12-11T11:42:13Z

python/sglang/srt/managers/schedule_batch.py

+                    # to retrieve its state from h. Adding 1 will give us the correct index in h,
+                    # otherwise the calculation will retrieve the state from the last_recurrent_state,
+                    # which is not correct.
+                    mamba_track_seqlen = req.mamba_branching_seqlen + 1


why do we need +1 in branching point while in non-branching point we do not need?

yizhang2077 · 2025-12-11T11:59:54Z

python/sglang/srt/managers/schedule_batch.py

                self.last_node,
                self.last_host_node,
                self.host_hit_length,
+                self.mamba_branching_seqlen,


I think when disable radix cache, it will cause error for extra input/output

hanming-lu added 7 commits December 8, 2025 20:36

wip

2f74618

wip

311f582

radix cache file done

7c8fa41

memory_pool done

0fd76d2

forward batch info and cuda graph runner done

11c08a6

model runner done

7f6b177

kernel done; hybrid linear backend done; eagle done; output process done

554f655

github-actions bot added the npu label Dec 10, 2025

hanming-lu added 3 commits December 10, 2025 06:32

schedule batch done; can start testing

42f35b3

TestQwen3Next passing

c63ff7e

add branching test; need to fix branching

cdd6123

yizhang2077 reviewed Dec 10, 2025

View reviewed changes

python/sglang/srt/managers/scheduler_output_processor_mixin.py Outdated Show resolved Hide resolved

hanming-lu and others added 4 commits December 10, 2025 21:15

qwen3-next fully working

5083e92

address comment

58c53c5

Merge branch 'main' into hanming/mamba-cache-p1

6e8764c

ready for ci; qwen3-next, kimi linear, and nemotron tests passed

70e42af

hanming-lu changed the title ~~[Qwen3-next] Prefix cache for qwen3-next~~ [Qwen3-next] radix cache v2 for qwen3-next Dec 11, 2025

Merge branch 'main' into hanming/mamba-cache-p1

4f87145

hanming-lu marked this pull request as ready for review December 11, 2025 01:30

hanming-lu requested review from Fridge003, Ying1123, hebiao064, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners December 11, 2025 01:30

github-actions bot added the run-ci label Dec 11, 2025

yizhang2077 reviewed Dec 11, 2025

View reviewed changes

runsuite

8e731fa

yizhang2077 reviewed Dec 11, 2025

View reviewed changes

python/sglang/srt/managers/schedule_batch.py Show resolved Hide resolved

hanming-lu and others added 7 commits December 11, 2025 05:28

comment

15261df

Merge branch 'main' into hanming/mamba-cache-p1

2cb48a3

default v1

480232e

fix

49c826d

fix

7d1c8ac

Merge branch 'main' into hanming/mamba-cache-p1

8f6507b

Merge branch 'hanming/mamba-cache-p1' of https://github.com/sgl-proje…

feb9930

…ct/sglang into hanming/mamba-cache-p1

yizhang2077 reviewed Dec 11, 2025

View reviewed changes

[Qwen3-next] radix cache v2 for qwen3-next #14792

Are you sure you want to change the base?

[Qwen3-next] radix cache v2 for qwen3-next #14792

Conversation

hanming-lu commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 10, 2025

Uh oh!

Uh oh!

hanming-lu commented Dec 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yizhang2077 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanming-lu commented Dec 10, 2025 •

edited

Loading

hanming-lu Dec 11, 2025 •

edited

Loading

yizhang2077 Dec 11, 2025 •

edited

Loading