Skip to content

Pluggable LLM backend — program consumers to the LLM interface #692

@jaylfc

Description

@jaylfc

Would you be open to a small refactor that lets qmd run against an alternative LLM backend without forking the engine?

Right now the engine hardcodes the concrete LlamaCpp class in a few places — LLMSessionManager, the store's getLlm, and createStore — even though an LLM interface already exists and LlamaCpp implements LLM. So the interface is there, but nothing can actually be substituted for it.

I'd like to program those consumers to the LLM interface and add a single injection seam, keeping LlamaCpp as the default everywhere (no behaviour change for existing users). Concretely:

  1. LLMSessionManager / getSessionManager / store helpers hold and resolve LLM instead of LlamaCpp.
  2. A getDefaultLLM() / setDefaultLLM() pair (falls back to the existing LlamaCpp singleton), plus a createStore({ llm }) per-store override.
  3. Add the members consumers already use to the LLM interface: embedBatch, the embed/generate/rerankModelName getters, and intent? on expandQuery. (Side note: LlamaCpp.expandQuery already accepts intent, but the LLM interface doesn't declare it — the store passes it today, so the interface is currently narrower than both the implementation and its callers.)

The one design question I'd want your read on: token-level methods. chunkDocumentByTokens calls tokenize/detokenize for token-aware truncation, but those have no equivalent on a non-llama.cpp backend (e.g. a remote model server with no local tokenizer). Rather than force every backend to implement them, I'd declare tokenize/detokenize/countTokens optional on the interface and have truncation fall back to character-based handling when they're absent. When the backend can tokenize (llama.cpp), the path is unchanged. This keeps the contract honest for any backend, not just a remote one.

Motivation on my end is running qmd as a thin client against a small server that does the embedding/reranking on dedicated hardware — but the decoupling stands on its own (testability, substitutability) regardless of that use case.

I have it working against current main (v2.5.3) — ~120/40 lines across 3 files, test:types + build green, plus the char-fallback truncation exercised by an injected tokenizer-less backend. Reference branch: https://github.com/jaylfc/qmd/tree/feat/pluggable-llm-backend (first commit is the decoupling; the others are my remote backend built on top, not part of the ask).

Happy to open the PR if the direction looks good, or adjust the approach (especially the optional-capability question) first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions