llama-server Shared Alias for multiple instances of the same model in Router Mode #22823
jaycenornin
started this conversation in
Ideas
Replies: 1 comment
-
|
FlexLLama is exactly what you're looking for: https://github.com/yazon/flexllama |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have two 16GB GPUs and a Q4 of Gemma 4 just barely fits on one with a tiny context. Any usable context overflows, and, for reasons I don't fully understand, it performs far better if I
--override-tensorto my CPU than if I offload the exact same tensors to the second GPU. What's more, if I let llama.cpp split it between GPUs automatically, my token performance tanks with just one parallel slot going. (I'm still fairly naive to all of this, only been at it a few weeks, so I'm sure I'm ignorant of a lot. But I'm not here to troubleshoot performance on this post.)To get the best performance, with parallelism, it stands to reason that I should run two instance of the model, one on each GPU. This keeps compute from crossing the PCIe bus between GPUs, so
-npon one model performs better, while increasing my total compute budget to two GPUs.Just one problem: Right now llama-server will treat each instance as its own separate model with its own separate id and alias.
This means I have to configure my clients (e.g. Pi) with two distinct model ids, and any given client or agent will always use the same model instance even if that instance is overloaded and the other instance is sitting idle.
It would be awesome if two instances of the same model could share an alias and llama-server could intelligently route between them.
In my head, the implementation of this feature would look like this:
model_presets.ini
By specifying the same alias for both instances, I can configure my clients with a single model id "gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m".
When llama-server receives a request for model/alias "gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_m", it:
Requests for
/metrics?model=gemma-4-26B-A4B-it-Claude-Opus-Distill.q4_k_mwould return a json array of all metrics from each running instance. Likewise for/propsetc. Requests for an alias when no model with that alias is loaded wouldautoloadthe first instance only. llama-server wouldautoloadthe next instance whenever a request comes in for the alias and the already loaded instances are busy.My two-bit technobrain tells me this shouldn't be extremely difficult to implement, but there are probably a bazillion edge cases that I'm not thinking of where it would break.
And sure, the "correct answer" is just to "get a bigger GPU". If anyone wants to donate a 5090 to my cause I won't say no :P but I still see a lot of value for this kind of load balancing being built-in to llama-server:
Prioritizing aliased instances might need a second new parameter, such as
--alias-priority, to specify a load balancing weight, but honestly I'd be happy even without that.Beta Was this translation helpful? Give feedback.
All reactions