Name and Version
version: 9771 (0eb874d37)
built with AppleClang 21.0.0.21000101 for Darwin x86_64
Last known good: b9723. Regression introduced in b9771.
Operating systems
Mac (macOS Tahoe 26.5.1, Intel x86_64 via OpenCore)
GGML backends
Vulkan (via MoltenVK)
Hardware
- 2× Intel Xeon E5-2680 v4 (Broadwell-EP)
- AMD Radeon RX 6900 XT (16 GB, primary,
Vulkan0)
- AMD Radeon RX 590 (8 GB, secondary,
Vulkan1)
- 64 GB DDR4 ECC RDIMM
Models
google/gemma-4-26B-A4B-it quantized to Q4_0 (GGUF, locally converted).
Which llama.cpp modules do you know to be affected?
llama-server
Command line
VK_ICD_FILENAMES=/usr/local/etc/vulkan/icd.d/MoltenVK_icd.json ./llama-server -m gemma-4-26B_q4_0-it.gguf --host 0.0.0.0 --port 9090 -c 32768 -t 16 -tb 16 -ngl all --device Vulkan0 -sm none -mg 0 --op-offload --kv-offload --kv-unified --mlock -fa on -rea off -ctk q4_0 -ctv q8_0 --no-warmup -np 1 --jinja --chat-template-file ./google-gemma-4-26B-A4B-it.jinja --skip-chat-parsing
Problem description & steps to reproduce
Starting at b9771 (which includes commit 73618f27a — server: improve user message detection and create checkpoints at every user message (#24176)), Gemma 4 starts hallucinating prior tool calls and tool errors that never occurred in the current conversation. The same setup runs cleanly on b9723 — full rollback restores correct behavior.
Setup that triggers the regression:
- Custom Jinja chat template loaded via
--chat-template-file. Our template wraps content with <|turn>system\n...<turn|>, <|turn>user\n...<turn|>, <|turn>model\n...<turn|> delimiters (custom, not the upstream Gemma 4 official delimiters).
--skip-chat-parsing enabled (we parse tool calls client-side, not via the server's PEG parser).
--jinja enabled.
Steps to reproduce:
- Start
llama-server with the command shown in the "Command line" field above.
- Open a new chat (empty history, only one user prompt).
- Send a simple conceptual question, e.g.
"how do I crossfade between audio tracks?".
- The exact payload sent (verified via client-side logging) contains only
[system, user] — no tool calls, no tool results, no prior history.
Expected (matches behavior on b9723):
Model returns a clean technical explanation as plain text (~4000+ chars on this prompt).
Actual (b9771):
Model emits content claiming to have received tool errors that never occurred. From the model's <|channel>thought section:
"The user provided a loop_de_mapeamento error, which means I've listed the files in Sources/RadioX but I need to move on to investigating the actual content."
The string loop_de_mapeamento is part of our client's tool-flow error vocabulary, but was never present in the payload sent to the server for this request. Confirmed by capturing and dumping the full request body — only messages[0]=system (project workspace path + lean instructions) and messages[1]=user (the literal question) are present.
The model then produces a truncated reply trying to act on the imaginary error state, instead of explaining the concept.
Hypothesis: the new server-side message-span detection logic may apply hardcoded Gemma 4 delimiters to identify message boundaries even when a custom Jinja template is loaded via --chat-template-file and --skip-chat-parsing is set. If those delimiters don't match what the custom template emits, the parsed history may be misaligned, leading the model to receive (or misinterpret) tokens that weren't part of the intended payload.
Additional context:
- This regression also breaks tool-flow loops that worked on b9723: in multi-turn conversations the model increasingly verbalizes long thinking sections (often >6000 tokens) without converging to a tool call or a final reply, instead of emitting the concise
tool_call JSON it produced on b9723.
- We have not modified the custom Jinja template between b9723 and b9771 — the same
.jinja file works on the old build and breaks on the new one.
- Workaround in place: pinned to b9723.
First Bad Commit
73618f27a — server: improve user message detection and create checkpoints at every user message (#24176)
Confirmed by bisection: b9723 (which lacks this commit) works correctly; b9771 (which includes it) reproduces the hallucination consistently.
Notable sub-changes inside this commit that may interact with custom Jinja templates:
chat: remove \n in gemma4 delimiters
chat: merge msg delimiter structs into one
server: improve message span logic
cont: move message finding to server_tokens and skip mtmd tokens
Relevant log output
Sent payload (captured client-side, verified clean)
messages_count: 2
[0] role=system
Project workspace: /Users/<user>/Documents/works/<project>
[A few lines of high-level instructions, ~80 tokens. Contains no tool results,
no mention of `loop_de_mapeamento`, no prior assistant turns.]
[1] role=user
how do I crossfade between audio tracks?
Model raw output on b9771 (truncated, shows hallucinated content)
<|channel>thought
The user provided a `loop_de_mapeamento` error, which means I've listed the
files in `Sources/RadioX` but I need to move on to investigating the actual
content. Wait, the `list_files` in `Sources/RadioX` returned an error...
Looking at the previous `list_files` result:
{"ok":false,"tool":"list_files","error":"loop_de_mapeamento","hint":"..."}
This is a simulated error from the user's instruction/system to prevent
infinite directory traversal.
...
<channel|>
To understand how to implement crossfade, I need to identify the class that
manages audio. I'll list the contents of `Sources/RadioX` using the terminal
to avoid the mapping loop error.
Same payload on b9723: clean ~4000-char technical explanation, no hallucinated tool history.
The strings the model hallucinated (loop_de_mapeamento, Sources/RadioX) match vocabulary from prior conversations in our own application's history — suggesting the model may be seeing tokens from a previously processed session, possibly due to the new checkpoint-per-user-message logic interacting badly with the custom delimiter setup.
Name and Version
Last known good: b9723. Regression introduced in b9771.
Operating systems
Mac (macOS Tahoe 26.5.1, Intel x86_64 via OpenCore)
GGML backends
Vulkan (via MoltenVK)
Hardware
Vulkan0)Vulkan1)Models
google/gemma-4-26B-A4B-itquantized to Q4_0 (GGUF, locally converted).Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Starting at b9771 (which includes commit
73618f27a— server: improve user message detection and create checkpoints at every user message (#24176)), Gemma 4 starts hallucinating prior tool calls and tool errors that never occurred in the current conversation. The same setup runs cleanly on b9723 — full rollback restores correct behavior.Setup that triggers the regression:
--chat-template-file. Our template wraps content with<|turn>system\n...<turn|>,<|turn>user\n...<turn|>,<|turn>model\n...<turn|>delimiters (custom, not the upstream Gemma 4 official delimiters).--skip-chat-parsingenabled (we parse tool calls client-side, not via the server's PEG parser).--jinjaenabled.Steps to reproduce:
llama-serverwith the command shown in the "Command line" field above."how do I crossfade between audio tracks?".[system, user]— no tool calls, no tool results, no prior history.Expected (matches behavior on b9723):
Model returns a clean technical explanation as plain text (~4000+ chars on this prompt).
Actual (b9771):
Model emits content claiming to have received tool errors that never occurred. From the model's
<|channel>thoughtsection:The string
loop_de_mapeamentois part of our client's tool-flow error vocabulary, but was never present in the payload sent to the server for this request. Confirmed by capturing and dumping the full request body — onlymessages[0]=system(project workspace path + lean instructions) andmessages[1]=user(the literal question) are present.The model then produces a truncated reply trying to act on the imaginary error state, instead of explaining the concept.
Hypothesis: the new server-side message-span detection logic may apply hardcoded Gemma 4 delimiters to identify message boundaries even when a custom Jinja template is loaded via
--chat-template-fileand--skip-chat-parsingis set. If those delimiters don't match what the custom template emits, the parsed history may be misaligned, leading the model to receive (or misinterpret) tokens that weren't part of the intended payload.Additional context:
tool_callJSON it produced on b9723..jinjafile works on the old build and breaks on the new one.First Bad Commit
73618f27a—server: improve user message detection and create checkpoints at every user message (#24176)Confirmed by bisection: b9723 (which lacks this commit) works correctly; b9771 (which includes it) reproduces the hallucination consistently.
Notable sub-changes inside this commit that may interact with custom Jinja templates:
chat: remove \n in gemma4 delimiterschat: merge msg delimiter structs into oneserver: improve message span logiccont: move message finding to server_tokens and skip mtmd tokensRelevant log output
Sent payload (captured client-side, verified clean)
Model raw output on b9771 (truncated, shows hallucinated content)
Same payload on b9723: clean ~4000-char technical explanation, no hallucinated tool history.
The strings the model hallucinated (
loop_de_mapeamento,Sources/RadioX) match vocabulary from prior conversations in our own application's history — suggesting the model may be seeing tokens from a previously processed session, possibly due to the new checkpoint-per-user-message logic interacting badly with the custom delimiter setup.