Skip to content

[SDKs] Self-heal catalog cache on BYOM cache-miss#760

Open
bmehta001 wants to merge 3 commits into
mainfrom
bhamehta/byom-sdk-self-heal
Open

[SDKs] Self-heal catalog cache on BYOM cache-miss#760
bmehta001 wants to merge 3 commits into
mainfrom
bhamehta/byom-sdk-self-heal

Conversation

@bmehta001
Copy link
Copy Markdown
Contributor

The SDKs cache the catalog in memory with a 6-hour TTL. When a user manually drops a model directory (BYOM) into the cache while the SDK is running, the SDK's in-memory map is stale: Core returns the BYOM id from get_cached_models IPC and get_model_list, but the SDK filters it out because its modelIdToModelVariant map does not yet know the id.

Add a force-refresh path on cache-miss in the four user-facing entry points:

  • get_cached_models / get_loaded_models: when any id in the fresh list returned by Core does not resolve against the local map, force a catalog refresh and re-resolve only the unresolved ids.
  • get_model(alias) / get_model_variant(id): on miss, force a refresh and retry the lookup once.

Added calls to await UpdateModels(ct).ConfigureAwait(false); in the C# SDK for consistency with the other SDKs, though it is not completely necessary for the BYOM feature

Validated against a real BYOM directory dropped into the cache at runtime

The v1 SDKs (js, python, cs, rust) cache the catalog in memory with a
6-hour TTL. When a user manually drops a model directory (BYOM) into the
cache while the SDK is running, the SDK's in-memory map is stale: Core
returns the BYOM id from `get_cached_models` IPC and `get_model_list`,
but the SDK filters it out because its `modelIdToModelVariant` map does
not yet know the id.

Add a force-refresh path on cache-miss in the four user-facing entry
points:
- get_cached_models / get_loaded_models: when any id in the fresh list
  returned by Core does not resolve against the local map, force a
  catalog refresh and re-resolve only the unresolved ids.
- get_model(alias) / get_model_variant(id): on miss, force a refresh
  and retry the lookup once.

No new public APIs and no new IPC commands. Relies on the FLCore
`GetModelsAsync` override that surfaces BYOM stubs.

Validated end-to-end via the Python SDK against a real BYOM directory
dropped into the cache at runtime: get_cached_models now returns the
BYOM, get_model resolves the alias, and Core's view is consistent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 3, 2026 00:35
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
foundry-local Ready Ready Preview, Comment Jun 3, 2026 2:27am

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “self-healing” catalog refresh behavior across the Rust, Python, JS, and C# SDKs so that when the on-disk BYOM cache changes while the SDK is running (and Core returns new IDs), the SDK refreshes its in-memory catalog on cache-miss and retries resolution.

Changes:

  • Add a forced refresh + retry path to get_model(alias) / get_model_variant(id) when the initial lookup misses.
  • Add forced refresh-on-miss when resolving IDs returned by Core for get_cached_models / get_loaded_models, via new helper resolvers.
  • Add force/force: true parameters to the various “update models” caching methods to bypass the 6-hour TTL when self-healing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
sdk/rust/src/catalog.rs Adds forced-refresh retry on alias/id miss and a helper to resolve cached/loaded model IDs.
sdk/python/src/catalog.py Adds force refresh support and self-heal retry logic for lookups and cached/loaded model resolution.
sdk/js/src/catalog.ts Adds force refresh support and self-heal retry logic plus ID resolution helper for cached/loaded APIs.
sdk/cs/src/Catalog.cs Adds forced-refresh retry logic and shared ID-resolution helper; updates UpdateModels signature to support forced refresh.

Comment thread sdk/rust/src/catalog.rs Outdated
Comment thread sdk/python/src/catalog.py
Comment thread sdk/js/src/catalog.ts Outdated
Comment thread sdk/cs/src/Catalog.cs Outdated
PR #760 review feedback: the previous resolve_model_ids implementation
in all four SDKs (Rust, Python, JS, C#) split the input model_ids into
`resolved` and `unresolved` lists in a single pass, then on cache-miss
forced a refresh and appended the newly-resolvable IDs after the first
batch. This reordered the output relative to the input model_ids list,
which is a behavioral regression vs. pre-PR get_cached_models /
get_loaded_models that returned variants in Core's ID order (minus
unknowns).

Switch all four SDKs to the same shape:

  1. First pass: check whether any id is missing (short-circuit on first
     miss); this needs no scratch storage.
  2. If anything is missing, force-refresh once.
  3. Final pass: iterate model_ids in input order and collect variants
     present in the catalog (filter-map style).

The single final pass preserves input ordering on both the warm path
(no refresh) and the self-heal path (after refresh). The JS dedupe
behavior via Set<IModel> is preserved.

Side benefit: -32 net lines across the four files; the implementation
now matches the structure the reviewer suggested.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment thread sdk/python/src/catalog.py
Comment thread sdk/python/src/catalog.py
Comment thread sdk/cs/src/Catalog.cs
Comment thread sdk/cs/src/Catalog.cs
Comment thread sdk/js/src/catalog.ts
Comment thread sdk/python/src/catalog.py
Addresses PR #760 review comments:

- Python get_model/get_model_variant and C# GetModelImplAsync/GetModelVariantImplAsync now short-circuit on null/empty/whitespace input, mirroring the validation already present in Rust and JS. Without this, those entry points triggered the wasteful self-heal forced refresh just to return None on bad input.

- Adds Python unit tests for self-heal on get_model and get_cached_models (mock _MockCoreInterop returns no models on warm fetch, then BYOM on forced refresh) and a no-refresh-on-empty-input test that warms the catalog and asserts the forced refresh is not triggered by '', '   ', or None.

- Adds matching JS self-heal unit tests for getModel/getModelVariant and getCachedModels using the same mock pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment thread sdk/rust/src/catalog.rs
Comment on lines +145 to +148
// Self-heal: the alias may belong to a BYOM model added to the cache
// directory after our last catalog refresh.
self.invalidator.invalidate();
self.update_models().await?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real but narrow race: window is ~one HTTP call to the local Core, and only manifests if two concurrent callers both miss on the same just-added BYOM at nearly the same instant. Worst case is a one-shot Unknown alias error that a caller-side retry resolves immediately.

Self-heal-on-cache-miss is scoped here as best-effort recovery for the single-caller BYOM scenario, not as a strict contract for concurrent BYOM discovery. Promoting it to a coalesced refresh_now (per-request timestamp under refresh_gate) is a clean follow-up but out of scope for this PR.

Comment thread sdk/rust/src/catalog.rs
Comment on lines +174 to +177
// Self-heal: the id may belong to a BYOM model added to the cache
// directory after our last catalog refresh.
self.invalidator.invalidate();
self.update_models().await?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real but narrow race: window is ~one HTTP call to the local Core, and only manifests if two concurrent callers both miss on the same just-added BYOM at nearly the same instant. Worst case is a one-shot Unknown alias error that a caller-side retry resolves immediately.

Self-heal-on-cache-miss is scoped here as best-effort recovery for the single-caller BYOM scenario, not as a strict contract for concurrent BYOM discovery. Promoting it to a coalesced refresh_now (per-request timestamp under refresh_gate) is a clean follow-up but out of scope for this PR.

Comment thread sdk/rust/src/catalog.rs
Comment on lines +217 to +220
if needs_refresh {
self.invalidator.invalidate();
self.update_models().await?;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real but narrow race: window is ~one HTTP call to the local Core, and only manifests if two concurrent callers both miss on the same just-added BYOM at nearly the same instant. Worst case is a one-shot Unknown alias error that a caller-side retry resolves immediately.

Self-heal-on-cache-miss is scoped here as best-effort recovery for the single-caller BYOM scenario, not as a strict contract for concurrent BYOM discovery. Promoting it to a coalesced refresh_now (per-request timestamp under refresh_gate) is a clean follow-up but out of scope for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants