[Feature] support qwen3.5 chat template#1657
[Feature] support qwen3.5 chat template#1657hhaAndroid wants to merge 7 commits intoInternLM:mainfrom
Conversation
|
@claude review |
|
Claude: ## Summary Adds Qwen3.5 VL chat template support with a custom tokenization path ( IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
|
@claude review |
| if chat_template.default_system is not None: | ||
| if self.messages[0]["role"] == "system": | ||
| self.messages[0]["content"] = chat_template.default_system | ||
| else: | ||
| self.messages.insert(0, {"role": "system", "content": chat_template.default_system}) |
There was a problem hiding this comment.
Claude: Warning: tokenize() still mutates self.messages in place — this was flagged in the previous review and remains unresolved.
When chat_template.default_system is not None and messages[0] is not a system message, self.messages.insert(0, ...) permanently modifies the object's state. If tokenize() is called a second time the inserted system message would already be there and its content would be overwritten rather than inserted again — but the original caller's data has been silently altered.
Suggested fix — operate on a shallow copy:
messages = [m.copy() for m in self.messages]
if chat_template.default_system is not None:
if messages[0]["role"] == "system":
messages[0]["content"] = chat_template.default_system
else:
messages.insert(0, {"role": "system", "content": chat_template.default_system})
token_ids, label_ids = qwen35_tokenize_fn_fastspeed(
messages, tokenizer, self.tools, add_vision_id=add_vision_id, return_labels=True
)|
Claude: ## Re-review Summary Thanks for the updates. Several issues from the first review have been addressed:
Remaining IssuesWarning
Nit
VerdictREQUEST_CHANGES — the in-place mutation bug in |
|
@claude review |
| encoded = tokenizer( | ||
| result, | ||
| return_offsets_mapping=True, | ||
| add_special_tokens=False, | ||
| ) | ||
| input_ids = encoded["input_ids"] |
There was a problem hiding this comment.
Claude: Warning: The except Exception catch is overly broad. Per CLAUDE.md: "Never use bare except: — always catch specific exception types." While this is Exception rather than bare except, it still silently swallows errors like TypeError, ValueError, or KeyError that could indicate real bugs in the data or tokenizer usage.
If the intent is to handle tokenizers that don't support return_offsets_mapping, catch the specific exception that raises (e.g., NotImplementedError or TypeError):
| encoded = tokenizer( | |
| result, | |
| return_offsets_mapping=True, | |
| add_special_tokens=False, | |
| ) | |
| input_ids = encoded["input_ids"] | |
| except (NotImplementedError, TypeError): |
| last_query_index = i | ||
|
|
||
| # ── 主循环 ──────────────────────────────────────────────────────────── | ||
| for idx, message in enumerate(messages): |
There was a problem hiding this comment.
Claude: Nit: The system message at idx=0 is already rendered in the pre-loop block (lines 141-155), but this line renders it again with do_vision_count=True. The result content is discarded (no branch handles role=="system" and is_first), but _render has a side effect — it increments image_count/video_count via nonlocal.
If a system message ever contains multimodal content (images/videos), these counters would be incremented spuriously, causing wrong counts for subsequent messages.
Consider skipping the render for the already-handled system message:
if role == "system" and is_first:
continue
content = _render(message["content"], True).strip()|
Claude: ## Re-review (Round 3) Summary Most issues from previous reviews have been addressed — nice work on the cleanup across patches 2-5. The copyright header, unused function, debug prints, and key-extraction simplification are all resolved. Remaining IssuesWarning
Nit
VerdictCOMMENT — the broad exception catch and missing type hints are worth addressing, but neither is a blocking correctness issue given the current test coverage. The phantom counter increment is theoretical (system messages are text-only in practice) but easy to fix. |
随着 jinja 的 chat template 的复杂度逐渐增加,特别是 Agent 场合,传统的自己解析 jinja 模板并针对性设计 loss 的做法越来越无法满足需求,可维护性越来越差。
因此 XTuner 采用全新方案,不再自己解析 jinja 模板,而是优先调用
tokenizer.apply_chat_template并针对性新增 loss 代码,从而实现一套通用的方案。采用了新的方案后,数据格式需要改动,采用更通用的格式可以兼容 think 和 非 think 模式,同时 toolcall 也可以方便支持。