标题：Support for multiple concurrent model instances in the server example #24947

BYSF-1 · 2026-06-23T14:26:58Z

BYSF-1
Jun 23, 2026

Hi Georgi,

I'm building an AI workflow engine (similar to n8n but AI-native), using llama.cpp as the inference backend via node-llama-cpp . First, thank you for this incredible project — it's the backbone of the entire local LLM ecosystem.

I'm running into a structural limitation that I believe is worth discussing at the llama.cpp level.

The problem: single-model instance in server mode

In a workflow, different nodes need different models — for example, node A does deep reasoning (DeepSeek-R1 8B), node B does fast summarization (Qwen 1.5B). Currently, the server example only supports one model at a time. Switching means:

Unload model A (~2-3 seconds)
Load model B (~10-30 seconds for 8B on CPU)
Node B starts executing
Worse: when the workflow DAG has parallelism, node C might need model A while node B is using model B. With a single-model server, this is impossible.

What I'm suggesting

A "multi-slot" server mode where multiple models can be loaded simultaneously and routed by model name:

# Proposed: start server with 
multiple model slots
./llama-server \
  --model-slot 
  deepseek=deepseek-r1-8b-q4.gguf \
  --model-slot qwen=qwen2.5-1.5b-q4.
  gguf

# Then requests route automatically:
curl /v1/chat/completions -d '
{"model":"deepseek", ...}'  → slot 0
curl /v1/chat/completions -d '
{"model":"qwen", ...}'      → slot 1

Each slot manages its own context, but they share the same HTTP server and port.

Why this matters beyond my use case

Agent orchestration (browser-use, CrewAI) often dispatches subtasks to different models
RAG pipelines: embedding model + generation model running side by side
Speculative decoding already loads a draft model — this is the natural generalization of that concept
I understand the memory constraints — this isn't about forcing everyone to load 10 models. It's about giving developers who have the memory the ability to use it productively.

If this is out of scope for the server example, even a lighter approach would help: an API to query "what model is currently loaded" and a "warm-up" endpoint that accepts a model URI and starts loading in the background, returning a promise/future that the client can poll. This would at least allow workflow engines to pre-load the next model while the current one is still generating.

Happy to discuss further or contribute if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

标题：Support for multiple concurrent model instances in the server example #24947

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

标题 ：Support for multiple concurrent model instances in the server example #24947

Uh oh!

BYSF-1 Jun 23, 2026

Replies: 0 comments

标题：Support for multiple concurrent model instances in the server example #24947

BYSF-1
Jun 23, 2026