Skip to content

Misc. bug: custom Jinja template + --skip-chat-parsing regression after #24176 (gemma4 delimiter changes), b9771 #24978

Description

@Clarck1

Name and Version

version: 9771 (0eb874d37)
built with AppleClang 21.0.0.21000101 for Darwin x86_64

Last known good: b9723. Regression introduced in b9771.

Operating systems

Mac (macOS Tahoe 26.5.1, Intel x86_64 via OpenCore)

GGML backends

Vulkan (via MoltenVK)

Hardware

  • 2× Intel Xeon E5-2680 v4 (Broadwell-EP)
  • AMD Radeon RX 6900 XT (16 GB, primary, Vulkan0)
  • AMD Radeon RX 590 (8 GB, secondary, Vulkan1)
  • 64 GB DDR4 ECC RDIMM

Models

google/gemma-4-26B-A4B-it quantized to Q4_0 (GGUF, locally converted).

Which llama.cpp modules do you know to be affected?

llama-server

Command line

VK_ICD_FILENAMES=/usr/local/etc/vulkan/icd.d/MoltenVK_icd.json ./llama-server -m gemma-4-26B_q4_0-it.gguf --host 0.0.0.0 --port 9090 -c 32768 -t 16 -tb 16 -ngl all --device Vulkan0 -sm none -mg 0 --op-offload --kv-offload --kv-unified --mlock -fa on -rea off -ctk q4_0 -ctv q8_0 --no-warmup -np 1 --jinja --chat-template-file ./google-gemma-4-26B-A4B-it.jinja --skip-chat-parsing

Problem description & steps to reproduce

Starting at b9771 (which includes commit 73618f27aserver: improve user message detection and create checkpoints at every user message (#24176)), Gemma 4 starts hallucinating prior tool calls and tool errors that never occurred in the current conversation. The same setup runs cleanly on b9723 — full rollback restores correct behavior.

Setup that triggers the regression:

  1. Custom Jinja chat template loaded via --chat-template-file. Our template wraps content with <|turn>system\n...<turn|>, <|turn>user\n...<turn|>, <|turn>model\n...<turn|> delimiters (custom, not the upstream Gemma 4 official delimiters).
  2. --skip-chat-parsing enabled (we parse tool calls client-side, not via the server's PEG parser).
  3. --jinja enabled.

Steps to reproduce:

  1. Start llama-server with the command shown in the "Command line" field above.
  2. Open a new chat (empty history, only one user prompt).
  3. Send a simple conceptual question, e.g. "how do I crossfade between audio tracks?".
  4. The exact payload sent (verified via client-side logging) contains only [system, user] — no tool calls, no tool results, no prior history.

Expected (matches behavior on b9723):

Model returns a clean technical explanation as plain text (~4000+ chars on this prompt).

Actual (b9771):

Model emits content claiming to have received tool errors that never occurred. From the model's <|channel>thought section:

"The user provided a loop_de_mapeamento error, which means I've listed the files in Sources/RadioX but I need to move on to investigating the actual content."

The string loop_de_mapeamento is part of our client's tool-flow error vocabulary, but was never present in the payload sent to the server for this request. Confirmed by capturing and dumping the full request body — only messages[0]=system (project workspace path + lean instructions) and messages[1]=user (the literal question) are present.

The model then produces a truncated reply trying to act on the imaginary error state, instead of explaining the concept.

Hypothesis: the new server-side message-span detection logic may apply hardcoded Gemma 4 delimiters to identify message boundaries even when a custom Jinja template is loaded via --chat-template-file and --skip-chat-parsing is set. If those delimiters don't match what the custom template emits, the parsed history may be misaligned, leading the model to receive (or misinterpret) tokens that weren't part of the intended payload.

Additional context:

  • This regression also breaks tool-flow loops that worked on b9723: in multi-turn conversations the model increasingly verbalizes long thinking sections (often >6000 tokens) without converging to a tool call or a final reply, instead of emitting the concise tool_call JSON it produced on b9723.
  • We have not modified the custom Jinja template between b9723 and b9771 — the same .jinja file works on the old build and breaks on the new one.
  • Workaround in place: pinned to b9723.

First Bad Commit

73618f27aserver: improve user message detection and create checkpoints at every user message (#24176)

Confirmed by bisection: b9723 (which lacks this commit) works correctly; b9771 (which includes it) reproduces the hallucination consistently.

Notable sub-changes inside this commit that may interact with custom Jinja templates:

  • chat: remove \n in gemma4 delimiters
  • chat: merge msg delimiter structs into one
  • server: improve message span logic
  • cont: move message finding to server_tokens and skip mtmd tokens

Relevant log output

Sent payload (captured client-side, verified clean)
messages_count: 2

[0] role=system
Project workspace: /Users/<user>/Documents/works/<project>
[A few lines of high-level instructions, ~80 tokens. Contains no tool results,
 no mention of `loop_de_mapeamento`, no prior assistant turns.]

[1] role=user
how do I crossfade between audio tracks?
Model raw output on b9771 (truncated, shows hallucinated content)
<|channel>thought
The user provided a `loop_de_mapeamento` error, which means I've listed the
files in `Sources/RadioX` but I need to move on to investigating the actual
content. Wait, the `list_files` in `Sources/RadioX` returned an error...

Looking at the previous `list_files` result:
{"ok":false,"tool":"list_files","error":"loop_de_mapeamento","hint":"..."}
This is a simulated error from the user's instruction/system to prevent
infinite directory traversal.
...
<channel|>
To understand how to implement crossfade, I need to identify the class that
manages audio. I'll list the contents of `Sources/RadioX` using the terminal
to avoid the mapping loop error.

Same payload on b9723: clean ~4000-char technical explanation, no hallucinated tool history.

The strings the model hallucinated (loop_de_mapeamento, Sources/RadioX) match vocabulary from prior conversations in our own application's history — suggesting the model may be seeing tokens from a previously processed session, possibly due to the new checkpoint-per-user-message logic interacting badly with the custom delimiter setup.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions