Skip to content

Use CLOCK (second-chance) eviction in the UFFD page cache#274

Open
sjmiller609 wants to merge 5 commits into
mainfrom
hypeship/uffd-cache-eviction-exp
Open

Use CLOCK (second-chance) eviction in the UFFD page cache#274
sjmiller609 wants to merge 5 commits into
mainfrom
hypeship/uffd-cache-eviction-exp

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 6, 2026

Summary

Switches the pager's shared page cache (lib/uffdpager/cache.go) from its current insertion-order (FIFO-like) eviction to CLOCK / second-chance, which approximates LRU without taking a write lock on the read path.

  • Each cache entry gets an atomic reference bit. Borrow (the fault hot path) and Get set it under the read lock — no list reordering — so concurrent faults on the same hot pages don't serialize behind a write lock.
  • Eviction sweeps from the back of the ring, sparing a referenced entry once (clearing its bit) and reclaiming the first unreferenced one.

Why

Continuously-faulted snapshot pages (the always-hot snapshot set) keep getting their reference bit re-set, so they stay resident, while idle/one-off snapshots age out — the retention behavior we want as the cache oversubscribes across many backing snapshots. The current FIFO behavior ignores reuse and can evict a hot page that's actively being faulted; true LRU would fix that but only by reordering a list under an exclusive lock, which would serialize simultaneous restores. CLOCK gets the recency benefit at the read-lock cost.

Bumps lib/uffdpager/VERSION since a pager runtime file changed.

Test plan

  • go test ./lib/uffdpager/ passes
  • -race clean on the cache tests (the reference bit is set by concurrent readers under RLock)
  • eviction test updated to assert second-chance semantics (referenced page survives, unreferenced is evicted)

Note

Medium Risk
Changes concurrent cache behavior on the UFFD fault hot path (Borrow); incorrect CLOCK logic could evict hot pages or hurt restore latency under memory pressure.

Overview
Replaces LRU list reordering in the UFFD page cache with CLOCK (second-chance) eviction, so hot snapshot pages are retained without taking a write lock on the fault path.

Each entry gets an atomic reference bit set on Get/Borrow hits under the shard read lock; eviction walks the ring from the back, gives referenced entries one pass (clears the bit and moves them forward), and drops the first unreferenced page until the shard is under budget. Add no longer moves entries to the front on update—it only sets the ref bit. lib/uffdpager/VERSION is bumped to 0.1.3; the eviction test now asserts referenced pages survive over untouched ones.

Reviewed by Cursor Bugbot for commit ef74e39. Bugbot is set up for automated code reviews on this repo. Configure here.

sjmiller609 and others added 5 commits June 6, 2026 21:27
Self-contained benchmark package comparing eviction policies for the
pager's shared page cache under a workload that models many VMs restoring
from a few shared backing snapshots (skewed popularity, a shared hot core
per snapshot, and a one-shot scan polluter).

Policies: prod (wraps the shipping cache), fifo (current behavior), lru,
clock (second-chance), tinylfu (W-TinyLFU-lite), and ristretto. Read-path
locking reflects each policy's real requirement so concurrency is measured
faithfully. Includes a hit-rate comparison, a hot-shared-key read benchmark
across shard counts and parallelism, and a p99-latency-under-concurrency
test.

Not wired into the pager; experiment only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Models the observed production shape: large per-snapshot working set
(cache holds only ~5 snapshots) with many distinct snapshots, so the
cache is heavily oversubscribed. Adds a Bursty config to compare
sequential per-snapshot fan-out against concurrent fan-out across
snapshots — the temporal pattern that flips the recency-vs-frequency
tradeoff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Models a stable always-hot snapshot set reused across the whole timeline
alongside customer-specific snapshots that fan out in waves then go idle.
Reports hit rate split by hot vs tail so we can see which policy keeps the
hot set resident while customers churn. Adds explicit per-snapshot fork
counts and a hot/tail schedule to the generator.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Borrow now marks an atomic reference bit under the read lock instead of
leaving recency untracked, and eviction sweeps from the back giving
referenced entries one reprieve before reclaiming the first unreferenced
one. This keeps continuously-faulted snapshot pages resident while idle
snapshots fall out, without taking a write lock on the read path, so
concurrent faults on hot pages don't serialize.

Bumps the pager VERSION since a runtime file changed.
Drops the throwaway lib/uffdpager/cachebench benchmark package and the
ristretto dependency it pulled in, now that CLOCK is implemented directly.
@sjmiller609 sjmiller609 changed the title Experiment: UFFD page-cache eviction policy comparison Use CLOCK (second-chance) eviction in the UFFD page cache Jun 6, 2026
@sjmiller609 sjmiller609 marked this pull request as ready for review June 6, 2026 22:19
@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Replaces the UFFD page cache eviction policy in hypeman from LRU to CLOCK (second-chance), reducing write-lock contention on the cache hot path under concurrent page faults.

Intended effect:

  • Instance spawn success rate: baseline 100% success (zero failures in 24h pre-deploy); confirmed if it remains at 100% post-deploy
  • "failed to create instance" errors: baseline ~2,400–2,500/hr (pre-existing infra noise); confirmed if no sustained elevation above that floor
  • Hypeman node uptime: baseline kernel_hypeman_uptime_missing_total = 0 (all nodes healthy); confirmed if metric stays at zero

Risks:

  • CLOCK sweep spin under full-cache pressure — "failed to create instance" errors in API logs, alert if sustained >5,000/hr for 2+ consecutive hours
  • Hypeman node panic/restartkernel_hypeman_uptime_missing_total going non-zero, alert on any non-zero value
  • Wrong eviction choices (cache churn) — elevated "failed to create instance" rate or drop in spawn total >30% from active-hour baseline (~800K–1.1M/hr)

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@sjmiller609 sjmiller609 requested a review from rgarcia June 6, 2026 22:24
@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Improves instance spawn throughput under concurrent workloads by switching hypeman's internal page cache from LRU to CLOCK (second-chance) eviction, reducing lock contention on the cache hit path.

Intended effect:

  • Instance spawn success rate (kernel_invocation_spawn_total): baseline 100% success=true, no failures observed in 24h pre-deploy; confirmed if success rate remains 100% and spawn volume stays within 57K–1.12M/hr.
  • "failed to create instance" errors (API logs): baseline ~2,400–2,500/hr (pre-existing background noise); confirmed if rate stays at or below this level post-deploy.

Risks:

  • CLOCK sweep spin under full cache pressure — "failed to create instance" errors in API logs; alert if sustained above 5,000/hr for 2+ consecutive hours.
  • Wrong-page eviction / cache churn — spawn success rate drop or elevated API spawn errors; alert if any success=false spawns appear or error rate doubles.
  • Hypeman node panic — because hypeman emits no uptime telemetry, panics surface only through a sustained spike in "failed to create instance" API errors combined with a drop in total spawn volume; alert on the same 5,000/hr threshold.
  • Degraded deploy — deploy duration baseline is 15–38s; alert if not completing or exceeds 120s.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the CLOCK switch end to end — read-path change is clean and the version bump is correct. one thing i'd like confirmed (a saturation edge case) plus a couple of test-coverage gaps. nothing here blocks; leaning comment-not-changes since the main item is "is this intended" rather than a definite defect.

question — does a just-faulted page get evicted on insert under a saturated hot shard?

Add pushes new entries at the front with ref=false (lib/uffdpager/cache.go:128), but the eviction sweep also moves spared (referenced) entries to the front (lib/uffdpager/cache.go:172) — ahead of the entry just pushed. if every resident page in the shard is referenced, the sweep relocates all of them in front of the new page, the new page ends up at the back, and (bit still false) gets evicted on that same Add. trace, 2-page shard, p1/p2 both hot:

Add(p3): ring [p3(0), p1(1), p2(1)]  bytes 3 > 2
  back=p2(1) → clear, MoveToFront → [p2(0), p3(0), p1(1)]
  back=p1(1) → clear, MoveToFront → [p1(0), p2(0), p3(0)]
  back=p3(0) → EVICT p3

so the page just read from backing (lib/uffdpager/server_faults_linux.go:197) is dropped, and the next fault for it misses → re-reads from disk → re-adds → evicted again. it only triggers when the shard is fully referenced (if any resident entry is unreferenced, that one is the victim and the new page survives) — but the "continuously-faulted hot set that saturates a shard" workload this PR targets is exactly when that can happen, and a newly-hot page can't get a foothold.

conventional CLOCK fix is to insert with the bit set (PushFront then ref.Store(true), mirroring the update path) so a new page gets one grace sweep. costs one extra sweep before a genuine one-off ages out. your call — but if the current behavior is intentional it's worth a comment, because it reads as accidental.

question (design) — sampled LRU considered?

agree the list-LRU read-side write lock is genuinely bad here — shared-snapshot restores fault identical keys concurrently, so they'd all serialize on one shard lock — so CLOCK is a reasonable pick. did you weigh redis-style sampled LRU (atomic last-access counter set under the read lock, sample-K-and-evict-oldest)? closer to true-LRU semantics and also avoids the read-side write lock. not advocating a rewrite, just curious whether it was on the table.

nit

  • lib/uffdpager/cache.go:123 vs :128 — the ref-bit asymmetry (update path sets true, insert leaves false) is the root of the question above and isn't documented anywhere. one line on the PushFront would save the next reader.

test coverage

  • the -race claim in the PR body rests on concurrent readers, but all three cache tests are single-goroutine, so -race never exercises a concurrent ref.Store. a small goroutine fan-out hammering Borrow on one key would make that claim real — the atomic on ref exists specifically for that path.
  • TestPageCacheSecondChanceEvictsUnreferenced covers "one spared + one unreferenced victim" but not the fully-referenced sweep (all bits set → full clear pass → eviction), which is both the degenerate CLOCK path and the one in the question above. worth a test pinning whatever behavior you settle on.

otherwise lgtm — lookup collapsing to a single RLock path is a nice simplification, and the VERSION bump is required (it's go:embed'd and gates the versioned systemd unit), not cosmetic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants