🐞 Bug Report
Describe the bug
The embedding client uses tiktoken (cl100k_base) to count tokens for chunk-size decisions, regardless of the configured embedding model. When a non-OpenAI model is used (e.g. BAAI/bge-m3, which uses an XLM-RoBERTa tokenizer), cl100k_base underestimates token counts by 30–40% on non-English text. Messages that the client considers "within limits" actually exceed the model's token limit, causing the embedding provider to return HTTP 400. The reconciler retries for ~3 hours (20 attempts × 10-min backoff), then marks the MessageEmbedding row as sync_state='failed' — permanently dropping it from vector search.
Is this a regression?
Yes — introduced (or rather, exposed) by #578 / #678, which made the embedding model name and base URL configurable. Before that change, the hardcoded model was always an OpenAI model (text-embedding-3-small), so cl100k_base was a correct match. Once operators switched to bge-m3 or other non-OpenAI embedders, the tokenizer mismatch became a silent data-loss bug.
To Reproduce
-
Configure a non-OpenAI embedding model:
EMBEDDING__MODEL_CONFIG__MODEL=baai/bge-m3
EMBEDDING__MODEL_CONFIG__BASE_URL=https://your-provider/v1
-
Create a message with ~23,000+ characters of non-English (e.g. Russian) or technical text. cl100k_base estimates ~6,300 tokens (< 8,192 limit) → no chunking.
-
The embedding provider receives the text, counts ~8,400 real tokens (> 8,192 limit) → HTTP 400.
-
Reconciler retries 20 times over ~3.3 hours, then marks the MessageEmbedding as failed.
-
The message is silently excluded from all vector similarity searches (WHERE embedding IS NOT NULL).
Measured data:
| Message chars |
cl100k tokens |
bge-m3 tokens |
Delta |
| 23,327 |
6,313 |
8,379 |
+32.7% |
| 25,000 |
6,316 |
8,972 |
+42.1% |
Expected behaviour
When the embedding model is changed, the token counter used for chunk-size decisions should match that model's actual tokenizer. Text that exceeds the model's real token limit should be split into chunks before being sent to the provider, not silently dropped after 20 failed retries.
Your environment
- OS: Linux (Docker)
- Honcho Server Version: current
main (post-plastic-labs/honcho#678)
- Embedding model:
baai/bge-m3 via OpenAI-compatible endpoint
Additional context
Root cause: src/embedding_client.py line ~182–185:
try:
encoding = tiktoken.encoding_for_model(self.model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
tiktoken.encoding_for_model("baai/bge-m3") raises KeyError, so it always falls back to cl100k_base. This encoding is used in prepare_chunks() and simple_batch_embed() to decide whether text fits within max_embedding_tokens.
Proposed solution: Add a configurable tokenizer field to the embedding model config:
# src/config.py — ConfiguredEmbeddingModelSettings
tokenizer: str | None = None # env: EMBEDDING__MODEL_CONFIG__TOKENIZER
Supported values:
None — current auto-detect behaviour (backwards-compatible default)
hf:BAAI/bge-m3 — HuggingFace tokenizer via optional tokenizers package
file:/path/to/tokenizer.json — local HF tokenizer JSON
tiktoken:o200k_base — explicit tiktoken encoding name
Implementation outline:
- Add
TokenizerLike protocol (encode / decode) in embedding_client.py
- Add
_HuggingFaceTokenizer adapter wrapping tokenizers.Tokenizer
- Add
_resolve_tokenizer(model, spec) factory
- Replace hardcoded
tiktoken call with factory
- Add
tokenizers as optional dependency in pyproject.toml
This is fully backwards-compatible: when tokenizer is unset (default), current behaviour is preserved. Operators who switch to non-OpenAI models can specify the correct tokenizer via env var.
Related issues:
🐞 Bug Report
Describe the bug
The embedding client uses
tiktoken(cl100k_base) to count tokens for chunk-size decisions, regardless of the configured embedding model. When a non-OpenAI model is used (e.g.BAAI/bge-m3, which uses an XLM-RoBERTa tokenizer),cl100k_baseunderestimates token counts by 30–40% on non-English text. Messages that the client considers "within limits" actually exceed the model's token limit, causing the embedding provider to return HTTP 400. The reconciler retries for ~3 hours (20 attempts × 10-min backoff), then marks theMessageEmbeddingrow assync_state='failed'— permanently dropping it from vector search.Is this a regression?
Yes — introduced (or rather, exposed) by #578 / #678, which made the embedding model name and base URL configurable. Before that change, the hardcoded model was always an OpenAI model (
text-embedding-3-small), socl100k_basewas a correct match. Once operators switched tobge-m3or other non-OpenAI embedders, the tokenizer mismatch became a silent data-loss bug.To Reproduce
Configure a non-OpenAI embedding model:
Create a message with ~23,000+ characters of non-English (e.g. Russian) or technical text.
cl100k_baseestimates ~6,300 tokens (< 8,192 limit) → no chunking.The embedding provider receives the text, counts ~8,400 real tokens (> 8,192 limit) → HTTP 400.
Reconciler retries 20 times over ~3.3 hours, then marks the
MessageEmbeddingasfailed.The message is silently excluded from all vector similarity searches (
WHERE embedding IS NOT NULL).Measured data:
Expected behaviour
When the embedding model is changed, the token counter used for chunk-size decisions should match that model's actual tokenizer. Text that exceeds the model's real token limit should be split into chunks before being sent to the provider, not silently dropped after 20 failed retries.
Your environment
main(post-plastic-labs/honcho#678)baai/bge-m3via OpenAI-compatible endpointAdditional context
Root cause:
src/embedding_client.pyline ~182–185:tiktoken.encoding_for_model("baai/bge-m3")raisesKeyError, so it always falls back tocl100k_base. This encoding is used inprepare_chunks()andsimple_batch_embed()to decide whether text fits withinmax_embedding_tokens.Proposed solution: Add a configurable
tokenizerfield to the embedding model config:Supported values:
None— current auto-detect behaviour (backwards-compatible default)hf:BAAI/bge-m3— HuggingFace tokenizer via optionaltokenizerspackagefile:/path/to/tokenizer.json— local HF tokenizer JSONtiktoken:o200k_base— explicit tiktoken encoding nameImplementation outline:
TokenizerLikeprotocol (encode/decode) inembedding_client.py_HuggingFaceTokenizeradapter wrappingtokenizers.Tokenizer_resolve_tokenizer(model, spec)factorytiktokencall with factorytokenizersas optional dependency inpyproject.tomlThis is fully backwards-compatible: when
tokenizeris unset (default), current behaviour is preserved. Operators who switch to non-OpenAI models can specify the correct tokenizer via env var.Related issues: