feat: smart federation routing (load + VRAM + model + prefix affinity + cross-server sync) and unified Cluster UI by localai-bot · Pull Request #10124 · mudler/LocalAI

localai-bot · 2026-06-01T08:08:22Z

Smart, prefix-cache-aware routing for the p2p federation server, sharing the routing brain with the NATS distributed mode, plus a unified Cluster UI. Builds on #10123 (the pkg/clusterrouting extraction).

What

The p2p federation server graduates from a blind round-robin TCP proxy into a model-aware, prefix-cache-aware, load+VRAM-aware L7 router - using the same pkg/clusterrouting policy the NATS side uses, and reusing the existing core/services/nodes/prefixcache engine (extractor, radix index, and Sync) rather than reimplementing any of it. The two distributed UIs are folded into one Cluster page.

How

Stage 1 - load + VRAM-aware (L4): gossip each node's free GPU VRAM (schema.NodeData.AvailableVRAM, from xsysinfo); select via clusterrouting.PickBestReplica (least in-flight, then most free VRAM). Adds a clock-injectable NodeData.IsOnlineAt.

Stage 2 - L7 + model + prefix affinity: convert the proxy to an L7 HTTP-terminating proxy: buffer the request under a configurable cap (--upload-limit, 413 over), choose a peer, stream the response (SSE keeps flowing), bypass websocket upgrades to a duplex copy, and force Connection: close on non-duplex forwards to avoid a keep-alive leak. Model-aware: peers gossip their model set (NodeData.Models); candidates are filtered to peers serving the requested model (empty set = eligible, so mixed-version swarms aren't starved). Prefix affinity: reuse prefixcache.ExtractChain/Index to find the warm peer for a prompt prefix, then make the final pick with a new load-guarded clusterrouting.PickWithAffinity (keeps the VRAM tier). Observed after each serve.

Stage 3 - cross-server affinity sync (opt-in): with --affinity-sync, federation servers share affinity by reusing prefixcache.Sync over a new genericChannelPublisher on edgevpn's generic channel (node.EnableGenericHub + a generic-channel handler; only sync-enabled servers join, so workers/members see no traffic). Same observe/invalidate event structs as the NATS path. Also adds the eviction ticker the federation index was missing (prevents unbounded tree growth, on or off).

Stage 4 - unified Cluster UI: fold /app/p2p ("swarm") and /app/nodes ("nodes") into one /app/cluster page with a shared summary and two collapsible, capability-gated sections (Distributed / Swarm). The two large pages are reused verbatim via an embedded prop (no rewrite); a new useP2PMode hook mirrors useDistributedMode. Old routes redirect; one sidebar entry; i18n across 5 languages. No backend / capabilities.js / Go-embed changes.

Notes / scope

Federation is L4, so model-awareness genuinely required the L7 conversion (stage 2, not stage 1).
No duplication: prefixcache, pkg/radixtree, the messaging events, and both existing UI pages already existed and are reused; the only new selection primitive is PickWithAffinity (to retain VRAM that prefixcache.Select lacks).
Backward-compatible gossip: NodeData is JSON-marshaled, so old peers omit the new fields (read as zero/nil = eligible / lowest VRAM tier).
--affinity-sync is off by default (single federation server is the common case); enable it on every server that should cohere.