Skip to content

[Bug] Embedding tokenizer mismatch causes silent chunking failures for non-OpenAI models #827

Description

@longirun

🐞 Bug Report

Describe the bug

The embedding client uses tiktoken (cl100k_base) to count tokens for chunk-size decisions, regardless of the configured embedding model. When a non-OpenAI model is used (e.g. BAAI/bge-m3, which uses an XLM-RoBERTa tokenizer), cl100k_base underestimates token counts by 30–40% on non-English text. Messages that the client considers "within limits" actually exceed the model's token limit, causing the embedding provider to return HTTP 400. The reconciler retries for ~3 hours (20 attempts × 10-min backoff), then marks the MessageEmbedding row as sync_state='failed' — permanently dropping it from vector search.


Is this a regression?

Yes — introduced (or rather, exposed) by #578 / #678, which made the embedding model name and base URL configurable. Before that change, the hardcoded model was always an OpenAI model (text-embedding-3-small), so cl100k_base was a correct match. Once operators switched to bge-m3 or other non-OpenAI embedders, the tokenizer mismatch became a silent data-loss bug.


To Reproduce

  1. Configure a non-OpenAI embedding model:

    EMBEDDING__MODEL_CONFIG__MODEL=baai/bge-m3
    EMBEDDING__MODEL_CONFIG__BASE_URL=https://your-provider/v1
    
  2. Create a message with ~23,000+ characters of non-English (e.g. Russian) or technical text. cl100k_base estimates ~6,300 tokens (< 8,192 limit) → no chunking.

  3. The embedding provider receives the text, counts ~8,400 real tokens (> 8,192 limit) → HTTP 400.

  4. Reconciler retries 20 times over ~3.3 hours, then marks the MessageEmbedding as failed.

  5. The message is silently excluded from all vector similarity searches (WHERE embedding IS NOT NULL).

Measured data:

Message chars cl100k tokens bge-m3 tokens Delta
23,327 6,313 8,379 +32.7%
25,000 6,316 8,972 +42.1%

Expected behaviour

When the embedding model is changed, the token counter used for chunk-size decisions should match that model's actual tokenizer. Text that exceeds the model's real token limit should be split into chunks before being sent to the provider, not silently dropped after 20 failed retries.


Your environment

  • OS: Linux (Docker)
  • Honcho Server Version: current main (post-plastic-labs/honcho#678)
  • Embedding model: baai/bge-m3 via OpenAI-compatible endpoint

Additional context

Root cause: src/embedding_client.py line ~182–185:

try:
    encoding = tiktoken.encoding_for_model(self.model)
except KeyError:
    encoding = tiktoken.get_encoding("cl100k_base")

tiktoken.encoding_for_model("baai/bge-m3") raises KeyError, so it always falls back to cl100k_base. This encoding is used in prepare_chunks() and simple_batch_embed() to decide whether text fits within max_embedding_tokens.

Proposed solution: Add a configurable tokenizer field to the embedding model config:

# src/config.py — ConfiguredEmbeddingModelSettings
tokenizer: str | None = None  # env: EMBEDDING__MODEL_CONFIG__TOKENIZER

Supported values:

  • None — current auto-detect behaviour (backwards-compatible default)
  • hf:BAAI/bge-m3 — HuggingFace tokenizer via optional tokenizers package
  • file:/path/to/tokenizer.json — local HF tokenizer JSON
  • tiktoken:o200k_base — explicit tiktoken encoding name

Implementation outline:

  • Add TokenizerLike protocol (encode / decode) in embedding_client.py
  • Add _HuggingFaceTokenizer adapter wrapping tokenizers.Tokenizer
  • Add _resolve_tokenizer(model, spec) factory
  • Replace hardcoded tiktoken call with factory
  • Add tokenizers as optional dependency in pyproject.toml

This is fully backwards-compatible: when tokenizer is unset (default), current behaviour is preserved. Operators who switch to non-OpenAI models can specify the correct tokenizer via env var.

Related issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions