Sleep/Wake on llama.cpp #524
Replies: 2 comments 2 replies
-
|
Hi, llama-swap manages starting and stopping things at a process level. This helps with reliability of swapping since all memory is freed and reset when a server is swapped in and out. If you have a lot of RAM I found that when the weights are in the kernel's block cache it loads into VRAM at about the speed of my PCIE bus, 9GB/sec for my DDR4/2333. Do you have a link to how llama.cpp is doing it? I'm curious how they're doing it so quickly when it takes longer than a second (depending on size of model) to just load the weights back into VRAM. |
Beta Was this translation helpful? Give feedback.
-
|
They keep part if it persistent, not sure exactly how, yet I know there's a small VRAM allocation for each active model that is static, then it loads the weights back, fully compiled and ready to serve. I don't know the back end well enough, yet from my usage, when it "sleeps", it is able to be back to a functional state in under a second. If use their older method, which is basically a sigterm, it takes minutes vs. what I get with the current llamaswap method. I'm sure you could tie to their method, yet from what I can tell it does rely upon the new router mode to put a model to sleep with the quick return. Sorry, I'm not much help, yet your user base could get a huge boost in time to first token when the model is swapped out if you hooked their work in this arena. Check out their docs on sleep and router mode, it's still in beta, yet it's slick for my setup, where I have an a100 and and a30, yet frequently switch models. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi All,
Investigating llama.cpp, they added the router mode recently, yet it's quite buggy still. They also, however introduced a feature that I'm not sure if it's in llama-swap and would like some insight. There's a sleep-mode trigger now. So, instead of killing the entire engine, it puts models in the router poolsleep (places it in RAM, including the graphs, caches, etc.) So, it does drop out of VRAM, yet when it's reloaded its time to first token is under a second. This is huge to me, being that I run many workflows that require specific models, and don't always want them in VRAM, yet I also have a ton of RAM on my cluster.
So, questions.
Aside, there's a sleep mode in VLLM as well, is this implemented? I believe it operates in a similar manner, saving the computationally-expensive parts of the startup off to RAM (or perhaps to disk as well, as this is also supposed to be in llama.cpp, yet doesn't quite work as of yet).
Thanks in advance, I appreciate any feedback, let me know if this one needs a feature request and I'll be happy to do so.
Beta Was this translation helpful? Give feedback.
All reactions