[Bug] Embedding tokenizer mismatch causes silent chunking failures for non-OpenAI models

# **🐞 Bug Report**

## **Describe the bug**

The embedding client uses `tiktoken` (`cl100k_base`) to count tokens for chunk-size decisions, regardless of the configured embedding model. When a non-OpenAI model is used (e.g. `BAAI/bge-m3`, which uses an XLM-RoBERTa tokenizer), `cl100k_base` underestimates token counts by 30–40% on non-English text. Messages that the client considers "within limits" actually exceed the model's token limit, causing the embedding provider to return HTTP 400. The reconciler retries for \~3 hours (20 attempts × 10-min backoff), then marks the `MessageEmbedding` row as `sync_state='failed'` — permanently dropping it from vector search.

---

### **Is this a regression?**

Yes — introduced (or rather, exposed) by plastic-labs/honcho#578 / plastic-labs/honcho#678, which made the embedding model name and base URL configurable. Before that change, the hardcoded model was always an OpenAI model (`text-embedding-3-small`), so `cl100k_base` was a correct match. Once operators switched to `bge-m3` or other non-OpenAI embedders, the tokenizer mismatch became a silent data-loss bug.

---

### **To Reproduce**

1. Configure a non-OpenAI embedding model:

   ```
   EMBEDDING__MODEL_CONFIG__MODEL=baai/bge-m3
   EMBEDDING__MODEL_CONFIG__BASE_URL=https://your-provider/v1
   ```
2. Create a message with \~23,000+ characters of non-English (e.g. Russian) or technical text. `cl100k_base` estimates \~6,300 tokens (< 8,192 limit) → no chunking.
3. The embedding provider receives the text, counts \~8,400 real tokens (> 8,192 limit) → HTTP 400.
4. Reconciler retries 20 times over \~3.3 hours, then marks the `MessageEmbedding` as `failed`.
5. The message is silently excluded from all vector similarity searches (`WHERE embedding IS NOT NULL`).

**Measured data:**


| Message chars | cl100k tokens | bge-m3 tokens | Delta |
| -- | -- | -- | -- |
| 23,327 | 6,313 | 8,379 | +32.7% |
| 25,000 | 6,316 | 8,972 | +42.1% |

---

### **Expected behaviour**

When the embedding model is changed, the token counter used for chunk-size decisions should match that model's actual tokenizer. Text that exceeds the model's real token limit should be split into chunks before being sent to the provider, not silently dropped after 20 failed retries.

---

### **Your environment**

* OS: Linux (Docker)
* Honcho Server Version: current `main` (post-plastic-labs/honcho#678)
* Embedding model: `baai/bge-m3` via OpenAI-compatible endpoint

---

### **Additional context**

**Root cause:** `src/embedding_client.py` line \~182–185:

```python
try:
    encoding = tiktoken.encoding_for_model(self.model)
except KeyError:
    encoding = tiktoken.get_encoding("cl100k_base")
```

`tiktoken.encoding_for_model("baai/bge-m3")` raises `KeyError`, so it always falls back to `cl100k_base`. This encoding is used in `prepare_chunks()` and `simple_batch_embed()` to decide whether text fits within `max_embedding_tokens`.

**Proposed solution:** Add a configurable `tokenizer` field to the embedding model config:

```python
# src/config.py — ConfiguredEmbeddingModelSettings
tokenizer: str | None = None  # env: EMBEDDING__MODEL_CONFIG__TOKENIZER
```

Supported values:

* `None` — current auto-detect behaviour (backwards-compatible default)
* `hf:BAAI/bge-m3` — HuggingFace tokenizer via optional `tokenizers` package
* `file:/path/to/tokenizer.json` — local HF tokenizer JSON
* `tiktoken:o200k_base` — explicit tiktoken encoding name

Implementation outline:

* Add `TokenizerLike` protocol (`encode` / `decode`) in `embedding_client.py`
* Add `_HuggingFaceTokenizer` adapter wrapping `tokenizers.Tokenizer`
* Add `_resolve_tokenizer(model, spec)` factory
* Replace hardcoded `tiktoken` call with factory
* Add `tokenizers` as optional dependency in `pyproject.toml`

This is fully backwards-compatible: when `tokenizer` is unset (default), current behaviour is preserved. Operators who switch to non-OpenAI models can specify the correct tokenizer via env var.

**Related issues:**

* plastic-labs/honcho#578 — made model name / base URL configurable (but didn't address tokenizer)
* plastic-labs/honcho#625 — dimension mismatch (related but different: vector dims, not token count)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Embedding tokenizer mismatch causes silent chunking failures for non-OpenAI models #827

🐞 Bug Report

Describe the bug

Is this a regression?

To Reproduce

Expected behaviour

Your environment

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Embedding tokenizer mismatch causes silent chunking failures for non-OpenAI models #827

Description

🐞 Bug Report

Describe the bug

Is this a regression?

To Reproduce

Expected behaviour

Your environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions