You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm building an AI workflow engine (similar to n8n but AI-native), using llama.cpp as the inference backend via node-llama-cpp . First, thank you for this incredible project — it's the backbone of the entire local LLM ecosystem.
I'm running into a structural limitation that I believe is worth discussing at the llama.cpp level.
The problem: single-model instance in server mode
In a workflow, different nodes need different models — for example, node A does deep reasoning (DeepSeek-R1 8B), node B does fast summarization (Qwen 1.5B). Currently, the server example only supports one model at a time. Switching means:
Unload model A (~2-3 seconds)
Load model B (~10-30 seconds for 8B on CPU)
Node B starts executing
Worse: when the workflow DAG has parallelism, node C might need model A while node B is using model B. With a single-model server, this is impossible.
What I'm suggesting
A "multi-slot" server mode where multiple models can be loaded simultaneously and routed by model name:
# Proposed: start server with
multiple model slots
./llama-server \
--model-slot
deepseek=deepseek-r1-8b-q4.gguf \
--model-slot qwen=qwen2.5-1.5b-q4.
gguf
# Then requests route automatically:
curl /v1/chat/completions -d '
{"model":"deepseek", ...}' → slot 0
curl /v1/chat/completions -d '
{"model":"qwen", ...}' → slot 1
Each slot manages its own context, but they share the same HTTP server and port.
Why this matters beyond my use case
Agent orchestration (browser-use, CrewAI) often dispatches subtasks to different models
RAG pipelines: embedding model + generation model running side by side
Speculative decoding already loads a draft model — this is the natural generalization of that concept
I understand the memory constraints — this isn't about forcing everyone to load 10 models. It's about giving developers who have the memory the ability to use it productively.
If this is out of scope for the server example, even a lighter approach would help: an API to query "what model is currently loaded" and a "warm-up" endpoint that accepts a model URI and starts loading in the background, returning a promise/future that the client can poll. This would at least allow workflow engines to pre-load the next model while the current one is still generating.
Happy to discuss further or contribute if there's interest.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Georgi,
I'm building an AI workflow engine (similar to n8n but AI-native), using llama.cpp as the inference backend via node-llama-cpp . First, thank you for this incredible project — it's the backbone of the entire local LLM ecosystem.
I'm running into a structural limitation that I believe is worth discussing at the llama.cpp level.
The problem: single-model instance in server mode
In a workflow, different nodes need different models — for example, node A does deep reasoning (DeepSeek-R1 8B), node B does fast summarization (Qwen 1.5B). Currently, the server example only supports one model at a time. Switching means:
Worse: when the workflow DAG has parallelism, node C might need model A while node B is using model B. With a single-model server, this is impossible.
What I'm suggesting
A "multi-slot" server mode where multiple models can be loaded simultaneously and routed by model name:
Each slot manages its own context, but they share the same HTTP server and port.
Why this matters beyond my use case
I understand the memory constraints — this isn't about forcing everyone to load 10 models. It's about giving developers who have the memory the ability to use it productively.
If this is out of scope for the server example, even a lighter approach would help: an API to query "what model is currently loaded" and a "warm-up" endpoint that accepts a model URI and starts loading in the background, returning a promise/future that the client can poll. This would at least allow workflow engines to pre-load the next model while the current one is still generating.
Happy to discuss further or contribute if there's interest.
Beta Was this translation helpful? Give feedback.
All reactions