Skip to content

[Feature] support qwen3.5 chat template#1657

Open
hhaAndroid wants to merge 7 commits intoInternLM:mainfrom
hhaAndroid:qwen35_orig_jinja
Open

[Feature] support qwen3.5 chat template#1657
hhaAndroid wants to merge 7 commits intoInternLM:mainfrom
hhaAndroid:qwen35_orig_jinja

Conversation

@hhaAndroid
Copy link
Copy Markdown
Collaborator

@hhaAndroid hhaAndroid commented Apr 7, 2026

随着 jinja 的 chat template 的复杂度逐渐增加,特别是 Agent 场合,传统的自己解析 jinja 模板并针对性设计 loss 的做法越来越无法满足需求,可维护性越来越差。

因此 XTuner 采用全新方案,不再自己解析 jinja 模板,而是优先调用 tokenizer.apply_chat_template 并针对性新增 loss 代码,从而实现一套通用的方案。

采用了新的方案后,数据格式需要改动,采用更通用的格式可以兼容 think 和 非 think 模式,同时 toolcall 也可以方便支持。

@hhaAndroid
Copy link
Copy Markdown
Collaborator Author

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 8, 2026

Claude: ## Summary

Adds Qwen3.5 VL chat template support with a custom tokenization path (Qwen35ChatMessages, qwen35_tokenize_fn_fastspeed), including reasoning/thinking block handling, tool call formatting (Qwen3.5 XML style), and multimodal (image/video) token replacement. Also extends base_mllm_tokenize_fn to accept the new "image" / "video" content type keys alongside existing "image_url" / "video_url".

Issues

Critical

  • qwen35_chat.py:297-299 — Assertion message is a plain string, not an f-string. {len(text)} will be printed literally on failure instead of showing the actual lengths.

Warning

  • qwen35_chat.py:408-414tokenize() mutates self.messages in place (inserts/replaces system message). Calling it twice on the same object will produce incorrect results.
  • qwen35_chat.py:396Qwen35ChatMessages extends BaseModel directly instead of BaseMessages, breaking the interface contract for the messages module.
  • qwen35_chat.py — Multiple public functions missing return type annotations (project requires type hints per CLAUDE.md).
  • qwen35_chat.py:295qwen35_process_text_and_loss_mask appears unused and duplicates logic already in qwen35_tokenize_fn_fastspeed.

Nit

  • Missing copyright header in qwen35_chat.py
  • Debug print statement left in test_qwen35_vl_tokenize_fn.py:249
  • base_mllm_tokenize_fn.py — repeated key-extraction pattern for "image_url"/"image" and "video_url"/"video" could be simplified

Verdict

REQUEST_CHANGES

@hhaAndroid
Copy link
Copy Markdown
Collaborator Author

@claude review

Comment on lines +397 to +401
if chat_template.default_system is not None:
if self.messages[0]["role"] == "system":
self.messages[0]["content"] = chat_template.default_system
else:
self.messages.insert(0, {"role": "system", "content": chat_template.default_system})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: tokenize() still mutates self.messages in place — this was flagged in the previous review and remains unresolved.

When chat_template.default_system is not None and messages[0] is not a system message, self.messages.insert(0, ...) permanently modifies the object's state. If tokenize() is called a second time the inserted system message would already be there and its content would be overwritten rather than inserted again — but the original caller's data has been silently altered.

Suggested fix — operate on a shallow copy:

messages = [m.copy() for m in self.messages]
if chat_template.default_system is not None:
    if messages[0]["role"] == "system":
        messages[0]["content"] = chat_template.default_system
    else:
        messages.insert(0, {"role": "system", "content": chat_template.default_system})

token_ids, label_ids = qwen35_tokenize_fn_fastspeed(
    messages, tokenizer, self.tools, add_vision_id=add_vision_id, return_labels=True
)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude 不需要考虑这种情况

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 8, 2026

Claude: ## Re-review Summary

Thanks for the updates. Several issues from the first review have been addressed:

  • Fixed: Copyright header added (patch 4)
  • Fixed: Unused qwen35_process_text_and_loss_mask removed (patch 4)
  • Fixed: Assertion f-string bug resolved (function removed)
  • Fixed: base_mllm_tokenize_fn.py key-extraction simplified (patch 4)
  • Acknowledged: Qwen35ChatMessages not extending BaseMessages — comment added explaining the intentional design decision

Remaining Issues

Warning

  • tokenize() mutates self.messages in place (posted inline comment). When default_system is set, the method inserts/replaces system messages on the original list. Work on a copy instead.

Nit

  • Debug print at test_qwen35_vl_tokenize_fn.py:249print(pixel_values_xtuner.size(), image_grid_thw_xtuner, text.count('seconds>'), 'xxx') should be removed.
  • Missing return type hints on public functionsget_offset_mapping, render_content, qwen35_tokenize_fn_fastspeed, qwen35_tokenize_fn_slowspeed still lack annotations per project standards.

Verdict

REQUEST_CHANGES — the in-place mutation bug in tokenize() should be fixed before merge.

@hhaAndroid
Copy link
Copy Markdown
Collaborator Author

@claude review

Comment on lines +275 to +280
encoded = tokenizer(
result,
return_offsets_mapping=True,
add_special_tokens=False,
)
input_ids = encoded["input_ids"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: The except Exception catch is overly broad. Per CLAUDE.md: "Never use bare except: — always catch specific exception types." While this is Exception rather than bare except, it still silently swallows errors like TypeError, ValueError, or KeyError that could indicate real bugs in the data or tokenizer usage.

If the intent is to handle tokenizers that don't support return_offsets_mapping, catch the specific exception that raises (e.g., NotImplementedError or TypeError):

Suggested change
encoded = tokenizer(
result,
return_offsets_mapping=True,
add_special_tokens=False,
)
input_ids = encoded["input_ids"]
except (NotImplementedError, TypeError):

last_query_index = i

# ── 主循环 ────────────────────────────────────────────────────────────
for idx, message in enumerate(messages):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: The system message at idx=0 is already rendered in the pre-loop block (lines 141-155), but this line renders it again with do_vision_count=True. The result content is discarded (no branch handles role=="system" and is_first), but _render has a side effect — it increments image_count/video_count via nonlocal.

If a system message ever contains multimodal content (images/videos), these counters would be incremented spuriously, causing wrong counts for subsequent messages.

Consider skipping the render for the already-handled system message:

if role == "system" and is_first:
    continue
content = _render(message["content"], True).strip()

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 8, 2026

Claude: ## Re-review (Round 3) Summary

Most issues from previous reviews have been addressed — nice work on the cleanup across patches 2-5. The copyright header, unused function, debug prints, and key-extraction simplification are all resolved.

Remaining Issues

Warning

  • qwen35_chat.py:275-280except Exception is too broad; should catch the specific exception type for tokenizers that don't support return_offsets_mapping.
  • qwen35_chat.py — Public functions still missing return type annotations (carried over from prior review).

Nit

  • qwen35_chat.py:175 — First system message is rendered twice (pre-loop + main loop) with different do_vision_count flags, causing phantom side effects on image/video counters.

Verdict

COMMENT — the broad exception catch and missing type hints are worth addressing, but neither is a blocking correctness issue given the current test coverage. The phantom counter increment is theoretical (system messages are text-only in practice) but easy to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant