Sleep/Wake on llama.cpp #524

IITYWYBMAB · 2026-02-15T19:45:27Z

IITYWYBMAB
Feb 15, 2026

Hi All,

Investigating llama.cpp, they added the router mode recently, yet it's quite buggy still. They also, however introduced a feature that I'm not sure if it's in llama-swap and would like some insight. There's a sleep-mode trigger now. So, instead of killing the entire engine, it puts models in the router poolsleep (places it in RAM, including the graphs, caches, etc.) So, it does drop out of VRAM, yet when it's reloaded its time to first token is under a second. This is huge to me, being that I run many workflows that require specific models, and don't always want them in VRAM, yet I also have a ton of RAM on my cluster.

So, questions.

Does llama-swap tie to this functionality?
If not, I know there's a lot off cross-over here, yet it may be worthwhile as llama-swap is a more robust and complete set of tooling and I'd prefer to use it over the buggy mess that is the not quite ready for prime-time router mode they've put in place on llama-cpp.

Aside, there's a sleep mode in VLLM as well, is this implemented? I believe it operates in a similar manner, saving the computationally-expensive parts of the startup off to RAM (or perhaps to disk as well, as this is also supposed to be in llama.cpp, yet doesn't quite work as of yet).

Thanks in advance, I appreciate any feedback, let me know if this one needs a feature request and I'll be happy to do so.

mostlygeek · 2026-02-16T03:54:08Z

mostlygeek
Feb 16, 2026
Maintainer

Hi,

llama-swap manages starting and stopping things at a process level. This helps with reliability of swapping since all memory is freed and reset when a server is swapped in and out. If you have a lot of RAM I found that when the weights are in the kernel's block cache it loads into VRAM at about the speed of my PCIE bus, 9GB/sec for my DDR4/2333.

Do you have a link to how llama.cpp is doing it? I'm curious how they're doing it so quickly when it takes longer than a second (depending on size of model) to just load the weights back into VRAM.

0 replies

IITYWYBMAB · 2026-02-27T08:39:36Z

IITYWYBMAB
Feb 27, 2026
Author

They keep part if it persistent, not sure exactly how, yet I know there's a small VRAM allocation for each active model that is static, then it loads the weights back, fully compiled and ready to serve. I don't know the back end well enough, yet from my usage, when it "sleeps", it is able to be back to a functional state in under a second. If use their older method, which is basically a sigterm, it takes minutes vs. what I get with the current llamaswap method. I'm sure you could tie to their method, yet from what I can tell it does rely upon the new router mode to put a model to sleep with the quick return. Sorry, I'm not much help, yet your user base could get a huge boost in time to first token when the model is swapped out if you hooked their work in this arena.

Check out their docs on sleep and router mode, it's still in beta, yet it's slick for my setup, where I have an a100 and and a30, yet frequently switch models.

2 replies

mostlygeek Feb 27, 2026
Maintainer

Can you provide a link to the docs on sleep? I seen this mentioned here and there and haven't been able to find the docs.

johnd666 Apr 2, 2026

--sleep-idle-seconds
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#sleeping-on-idle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sleep/Wake on llama.cpp #524

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Sleep/Wake on llama.cpp #524

Uh oh!

Uh oh!

IITYWYBMAB Feb 15, 2026

Replies: 2 comments · 2 replies

Uh oh!

mostlygeek Feb 16, 2026 Maintainer

Uh oh!

IITYWYBMAB Feb 27, 2026 Author

Uh oh!

mostlygeek Feb 27, 2026 Maintainer

Uh oh!

johnd666 Apr 2, 2026

IITYWYBMAB
Feb 15, 2026

Replies: 2 comments 2 replies

mostlygeek
Feb 16, 2026
Maintainer

IITYWYBMAB
Feb 27, 2026
Author

mostlygeek Feb 27, 2026
Maintainer