llama-server Shared Alias for multiple instances of the same model in Router Mode #22823

jaycenornin · 2026-05-07T21:01:26Z

jaycenornin
May 7, 2026

I have two 16GB GPUs and a Q4 of Gemma 4 just barely fits on one with a tiny context. Any usable context overflows, and, for reasons I don't fully understand, it performs far better if I --override-tensor to my CPU than if I offload the exact same tensors to the second GPU. What's more, if I let llama.cpp split it between GPUs automatically, my token performance tanks with just one parallel slot going. (I'm still fairly naive to all of this, only been at it a few weeks, so I'm sure I'm ignorant of a lot. But I'm not here to troubleshoot performance on this post.)

To get the best performance, with parallelism, it stands to reason that I should run two instance of the model, one on each GPU. This keeps compute from crossing the PCIe bus between GPUs, so -np on one model performs better, while increasing my total compute budget to two GPUs.

Just one problem: Right now llama-server will treat each instance as its own separate model with its own separate id and alias.
This means I have to configure my clients (e.g. Pi) with two distinct model ids, and any given client or agent will always use the same model instance even if that instance is overloaded and the other instance is sitting idle.

It would be awesome if two instances of the same model could share an alias and llama-server could intelligently route between them.

In my head, the implementation of this feature would look like this:
model_presets.ini

[gemma-4_GPU0_-26B-A4B-it-Claude-Opus-Distill.q4_k_m]
model = D:\Models\teichai_gemma-4-26B-A4B-it-Claude-Opus-Distill_v2Updated.q4_k_m.gguf
device = Vulkan0
alias = gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m

[gemma-4_GPU1_-26B-A4B-it-Claude-Opus-Distill.q4_k_m]
model = D:\Models\teichai_gemma-4-26B-A4B-it-Claude-Opus-Distill_v2Updated.q4_k_m.gguf
device = Vulkan1
alias = gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m

By specifying the same alias for both instances, I can configure my clients with a single model id "gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m".

When llama-server receives a request for model/alias "gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m", it:

looks up the alias and finds every model instance using that alias
checks if the request is already associated with one of those instances (e.g. keep a conversation on the same instance if possible)
checks the current load of each instance
proxies the request to either its already-associated instance or to the least-busy instance if not already associated

Requests for /metrics?model=gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m would return a json array of all metrics from each running instance. Likewise for /props etc. Requests for an alias when no model with that alias is loaded would autoload the first instance only. llama-server would autoload the next instance whenever a request comes in for the alias and the already loaded instances are busy.

My two-bit technobrain tells me this shouldn't be extremely difficult to implement, but there are probably a bazillion edge cases that I'm not thinking of where it would break.

And sure, the "correct answer" is just to "get a bigger GPU". If anyone wants to donate a 5090 to my cause I won't say no :P but I still see a lot of value for this kind of load balancing being built-in to llama-server:

Multiple instances of the same model (for increased parallel workflows like multi-agent) are abstracted from client settings, makes client setup easier
You could potentially alias two different models if you wanted to
You could intelligently prioritize instances to optimize available resources (e.g. use instance1 on GPU first, use instance2 on CPU or instance3 on RPC only when needed, use instance4 with a different, smaller model as a last resort)

Prioritizing aliased instances might need a second new parameter, such as --alias-priority, to specify a load balancing weight, but honestly I'd be happy even without that.

yazon · 2026-06-12T18:25:54Z

yazon
Jun 12, 2026

FlexLLama is exactly what you're looking for: https://github.com/yazon/flexllama

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-server Shared Alias for multiple instances of the same model in Router Mode #22823

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

llama-server Shared Alias for multiple instances of the same model in Router Mode #22823

Uh oh!

jaycenornin May 7, 2026

Replies: 1 comment

Uh oh!

yazon Jun 12, 2026

jaycenornin
May 7, 2026

yazon
Jun 12, 2026