-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Qwen3-next] radix cache v2 for qwen3-next #14792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
| mask = (req.extend_input_len // FLA_CHUNK_SIZE) * FLA_CHUNK_SIZE > 0 | ||
| mamba_track_mask_cpu.append(mask) | ||
| mamba_track_indices_cpu.append( | ||
| req.mamba_ping_pong_track_buffer[req.mamba_next_track_idx].item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will cause device/host sync here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it break overlap schedule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| swa_full_tokens_ratio: float = 0.8 | ||
| disable_hybrid_swa_memory: bool = False | ||
| radix_eviction_policy: str = "lru" | ||
| mamba_track_interval: int = 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate define
| cached_tokens = result["meta_info"]["cached_tokens"] | ||
| if cache_hit: | ||
| assert ( | ||
| cached_tokens > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the shape of tree in test can be more complex, and cached_tokens we can directly predict
| self.extend_logprob_start_lens = [r.extend_logprob_start_len for r in reqs] | ||
| self.extend_input_logprob_token_ids = extend_input_logprob_token_ids | ||
|
|
||
| if get_global_server_args().enable_mamba_radix_cache_v2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can also be wrapped in _mamba_radix_cache_v2_prepare_for_extend
|
|
||
| # copy mamba state to req local space if cow is true | ||
| if cow_mamba and last_node.mamba_value is not None: | ||
| assert req.req_pool_idx is None # req_pool_idx is uninitialed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this assertion
| # does not have a mamba value. | ||
| if len(value) > best_value_len: | ||
| fla_chunk_aligned_seqlen = ( | ||
| sum(len(v) for v in value) // FLA_CHUNK_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we do not use (len(value) // FLA_CHUN_SIZE) * FLA_CHUN_SIZE directly here?
| # to retrieve its state from h. Adding 1 will give us the correct index in h, | ||
| # otherwise the calculation will retrieve the state from the last_recurrent_state, | ||
| # which is not correct. | ||
| mamba_track_seqlen = req.mamba_branching_seqlen + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need +1 in branching point while in non-branching point we do not need?
| self.last_node, | ||
| self.last_host_node, | ||
| self.host_hit_length, | ||
| self.mamba_branching_seqlen, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when disable radix cache, it will cause error for extra input/output

Motivation
Modifications
--enable-mamba-radix-cache-v2Accuracy Tests
--enable-mamba-radix-cache-v2on and offBenchmarking and Profiling
Checklist