Misc. bug: custom Jinja template + --skip-chat-parsing regression after #24176 (gemma4 delimiter changes), b9771

### Name and Version

```
version: 9771 (0eb874d37)
built with AppleClang 21.0.0.21000101 for Darwin x86_64
```

Last known good: **b9723**. Regression introduced in **b9771**.

### Operating systems

Mac (macOS Tahoe 26.5.1, Intel x86_64 via OpenCore)

### GGML backends

Vulkan (via MoltenVK)

### Hardware

- 2× Intel Xeon E5-2680 v4 (Broadwell-EP)
- AMD Radeon RX 6900 XT (16 GB, primary, `Vulkan0`)
- AMD Radeon RX 590 (8 GB, secondary, `Vulkan1`)
- 64 GB DDR4 ECC RDIMM

### Models

`google/gemma-4-26B-A4B-it` quantized to Q4_0 (GGUF, locally converted).

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
VK_ICD_FILENAMES=/usr/local/etc/vulkan/icd.d/MoltenVK_icd.json ./llama-server -m gemma-4-26B_q4_0-it.gguf --host 0.0.0.0 --port 9090 -c 32768 -t 16 -tb 16 -ngl all --device Vulkan0 -sm none -mg 0 --op-offload --kv-offload --kv-unified --mlock -fa on -rea off -ctk q4_0 -ctv q8_0 --no-warmup -np 1 --jinja --chat-template-file ./google-gemma-4-26B-A4B-it.jinja --skip-chat-parsing
```

### Problem description & steps to reproduce

Starting at **b9771** (which includes commit `73618f27a` — *server: improve user message detection and create checkpoints at every user message (#24176)*), Gemma 4 starts hallucinating prior tool calls and tool errors that never occurred in the current conversation. The same setup runs cleanly on **b9723** — full rollback restores correct behavior.

**Setup that triggers the regression:**

1. Custom Jinja chat template loaded via `--chat-template-file`. Our template wraps content with `<|turn>system\n...<turn|>`, `<|turn>user\n...<turn|>`, `<|turn>model\n...<turn|>` delimiters (custom, not the upstream Gemma 4 official delimiters).
2. `--skip-chat-parsing` enabled (we parse tool calls client-side, not via the server's PEG parser).
3. `--jinja` enabled.

**Steps to reproduce:**

1. Start `llama-server` with the command shown in the "Command line" field above.
2. Open a new chat (empty history, only one user prompt).
3. Send a simple conceptual question, e.g. `"how do I crossfade between audio tracks?"`.
4. The exact payload sent (verified via client-side logging) contains only `[system, user]` — no tool calls, no tool results, no prior history.

**Expected (matches behavior on b9723):**

Model returns a clean technical explanation as plain text (~4000+ chars on this prompt).

**Actual (b9771):**

Model emits content claiming to have received tool errors that never occurred. From the model's `<|channel>thought` section:

> "The user provided a `loop_de_mapeamento` error, which means I've listed the files in `Sources/RadioX` but I need to move on to investigating the actual content."

The string `loop_de_mapeamento` is part of our client's tool-flow error vocabulary, but **was never present in the payload sent to the server** for this request. Confirmed by capturing and dumping the full request body — only `messages[0]=system` (project workspace path + lean instructions) and `messages[1]=user` (the literal question) are present.

The model then produces a truncated reply trying to act on the imaginary error state, instead of explaining the concept.

**Hypothesis:** the new server-side message-span detection logic may apply hardcoded Gemma 4 delimiters to identify message boundaries even when a custom Jinja template is loaded via `--chat-template-file` and `--skip-chat-parsing` is set. If those delimiters don't match what the custom template emits, the parsed history may be misaligned, leading the model to receive (or misinterpret) tokens that weren't part of the intended payload.

**Additional context:**

- This regression also breaks tool-flow loops that worked on b9723: in multi-turn conversations the model increasingly verbalizes long thinking sections (often >6000 tokens) without converging to a tool call or a final reply, instead of emitting the concise `tool_call` JSON it produced on b9723.
- We have not modified the custom Jinja template between b9723 and b9771 — the same `.jinja` file works on the old build and breaks on the new one.
- Workaround in place: pinned to b9723.

### First Bad Commit

`73618f27a` — `server: improve user message detection and create checkpoints at every user message (#24176)`

Confirmed by bisection: b9723 (which lacks this commit) works correctly; b9771 (which includes it) reproduces the hallucination consistently.

Notable sub-changes inside this commit that may interact with custom Jinja templates:

- `chat: remove \n in gemma4 delimiters`
- `chat: merge msg delimiter structs into one`
- `server: improve message span logic`
- `cont: move message finding to server_tokens and skip mtmd tokens`

### Relevant log output

<details>
<summary>Sent payload (captured client-side, verified clean)</summary>

~~~
messages_count: 2

[0] role=system
Project workspace: /Users/<user>/Documents/works/<project>
[A few lines of high-level instructions, ~80 tokens. Contains no tool results,
 no mention of `loop_de_mapeamento`, no prior assistant turns.]

[1] role=user
how do I crossfade between audio tracks?
~~~

</details>

<details>
<summary>Model raw output on b9771 (truncated, shows hallucinated content)</summary>

~~~
<|channel>thought
The user provided a `loop_de_mapeamento` error, which means I've listed the
files in `Sources/RadioX` but I need to move on to investigating the actual
content. Wait, the `list_files` in `Sources/RadioX` returned an error...

Looking at the previous `list_files` result:
{"ok":false,"tool":"list_files","error":"loop_de_mapeamento","hint":"..."}
This is a simulated error from the user's instruction/system to prevent
infinite directory traversal.
...
<channel|>
To understand how to implement crossfade, I need to identify the class that
manages audio. I'll list the contents of `Sources/RadioX` using the terminal
to avoid the mapping loop error.
~~~

</details>

**Same payload on b9723:** clean ~4000-char technical explanation, no hallucinated tool history.

The strings the model hallucinated (`loop_de_mapeamento`, `Sources/RadioX`) match vocabulary from prior conversations in our own application's history — suggesting the model may be seeing tokens from a previously processed session, possibly due to the new checkpoint-per-user-message logic interacting badly with the custom delimiter setup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: custom Jinja template + --skip-chat-parsing regression after #24176 (gemma4 delimiter changes), b9771 #24978

Name and Version

Operating systems

GGML backends

Hardware

Models

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Misc. bug: custom Jinja template + --skip-chat-parsing regression after #24176 (gemma4 delimiter changes), b9771 #24978

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions