Pluggable LLM backend — program consumers to the LLM interface

Would you be open to a small refactor that lets qmd run against an alternative `LLM` backend without forking the engine?

Right now the engine hardcodes the concrete `LlamaCpp` class in a few places — `LLMSessionManager`, the store's `getLlm`, and `createStore` — even though an `LLM` interface already exists and `LlamaCpp implements LLM`. So the interface is there, but nothing can actually be substituted for it.

I'd like to program those consumers to the `LLM` interface and add a single injection seam, keeping `LlamaCpp` as the default everywhere (no behaviour change for existing users). Concretely:

1. `LLMSessionManager` / `getSessionManager` / store helpers hold and resolve `LLM` instead of `LlamaCpp`.
2. A `getDefaultLLM()` / `setDefaultLLM()` pair (falls back to the existing `LlamaCpp` singleton), plus a `createStore({ llm })` per-store override.
3. Add the members consumers already use to the `LLM` interface: `embedBatch`, the `embed/generate/rerankModelName` getters, and `intent?` on `expandQuery`. (Side note: `LlamaCpp.expandQuery` already accepts `intent`, but the `LLM` interface doesn't declare it — the store passes it today, so the interface is currently narrower than both the implementation and its callers.)

The one design question I'd want your read on: **token-level methods.** `chunkDocumentByTokens` calls `tokenize`/`detokenize` for token-aware truncation, but those have no equivalent on a non-llama.cpp backend (e.g. a remote model server with no local tokenizer). Rather than force every backend to implement them, I'd declare `tokenize`/`detokenize`/`countTokens` **optional** on the interface and have truncation fall back to character-based handling when they're absent. When the backend *can* tokenize (llama.cpp), the path is unchanged. This keeps the contract honest for any backend, not just a remote one.

Motivation on my end is running qmd as a thin client against a small server that does the embedding/reranking on dedicated hardware — but the decoupling stands on its own (testability, substitutability) regardless of that use case.

I have it working against current `main` (v2.5.3) — ~120/40 lines across 3 files, `test:types` + `build` green, plus the char-fallback truncation exercised by an injected tokenizer-less backend. Reference branch: https://github.com/jaylfc/qmd/tree/feat/pluggable-llm-backend (first commit is the decoupling; the others are my remote backend built on top, not part of the ask).

Happy to open the PR if the direction looks good, or adjust the approach (especially the optional-capability question) first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pluggable LLM backend — program consumers to the LLM interface #692

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pluggable LLM backend — program consumers to the LLM interface #692

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions