Skip to content

fix(cef): shared-host burst-render + full robustness hardening (audit + re-audit)#6

Merged
wenkaifan0720 merged 3 commits into
mainfrom
fix/cef-robustness-hardening
Jun 23, 2026
Merged

fix(cef): shared-host burst-render + full robustness hardening (audit + re-audit)#6
wenkaifan0720 merged 3 commits into
mainfrom
fix/cef-robustness-hardening

Conversation

@wenkaifan0720

Copy link
Copy Markdown
Collaborator

Makes the multi-browser-per-host (shared-profile) path production-robust. Three
stacked commits — the shared-host burst-render fix it builds on, the hardening, and
the re-audit follow-ups. Supersedes #5 (its commit is the base of this branch).

1. Shared-host burst-render (foundation — was #5)

Pace per-host browser creates + retry targetId resolution, so a burst of tiles on one
shared host renders instead of leaving later tiles blank.

2. Robustness hardening — all 3 CRITICAL + 9 HIGH from a 6-agent audit

  • C1 never-painted recovery: first-present watchdog → kOpInvalidate re-kick →
    paintStalled to Dart; cef_host Invalidate() on load-end.
  • C2 ad-hoc named-profile refusal re-homes every session onto ephemeral hosts
    (was: shut the shared host down, stranding siblings).
  • C3 SendFrame snapshots the fd under the write mutex; exchange(-1)-before-close.
  • H1 reader join gated on readerStarted alone; leak-not-close fd on join timeout.
  • H2 each CdpRelay actively issues Target.attachToTarget, learns its session
    from its own demuxed response, synthesizes one attachedToTarget, stops forwarding
    browser-level setAutoAttach (fixes "No page found" + a cross-tile control leak).
  • H3 async CreateBrowser bound in OnAfterCreatedkOpCreated → completion-
    driven pacer (with timeout backstop), replacing the 0.18s magic constant.
  • H4 atomic geometry snapshot in create + resize publish.
  • H5 single-owner pid reaping.
  • H6 clear pacer state on death/shutdown.
  • H7 kOpCreateFailed → drop just that session (processGone).
  • H8 browserId-wrap is a hard precondition (was a debug assert).
  • H9 malformed IPC frame logged before host teardown (native + Swift).

3. Re-audit (6 lenses, every finding refuted from 3 angles → 9 confirmed, 6 rejected)

Two HIGH were regressions the H3 async-create change itself introduced:

  • Orphan-browser leak on dispose-during-async-create: Slot.close_requested,
    recorded by DoDisposeBrowser and honored in OnAfterCreated.
  • CDP relay cross-client confusion: reset in-flight self-attach state on client
    detach + a per-connect client generation gating the response write + queued
    concurrent-setAutoAttach acks.
  • C2 respawn spawn-failure now processGone+disposes (was: stranded the session).
  • C1 watchdog is now visibility-aware (no spurious paintStalled for hidden/off-screen
    tiles — the normal lazy-spawn pattern).
  • Debug onCdpMessage restore no longer clobbers a chained relay fan-out.
  • Per-frame presentLock folded into the browsersLock the reader already holds.

Verification

Compiles clean: cef_host (cmake/ninja), the example macOS app (Swift+Dart), the full
firebase-extended Campus build (bundling the rebuilt cef_host), and flutter analyze.
The single-tile example renders end-to-end through the async-create path. Multi-tile
burst + concurrent agent-driving (H2) are compile-verified; full behavioral coverage
needs a live multi-tile Campus session.

🤖 Generated with Claude Code

wenkaifan0720 and others added 3 commits June 22, 2026 17:44
…red-host burst)

A burst of opCreateBrowser on one shared cef_host handed its single CEF UI
thread a pile of blocking CreateBrowserSync calls that serialized + contended
the one shared GPU/Viz accelerated-surface handshake — later browsers got no
surface and never painted (blank tile), and their targetId resolve wedged
behind the create backlog. Two Swift-side fixes (CefProfileHost):

- Per-host create pacing: createSendQueue + pumpCreateQueue send opCreateBrowser
  one at a time, spaced 0.18s, instead of all at once — each browser's create +
  surface handshake completes before the next contends the GPU process. Verified
  live: a 7-cefWebview burst on one shared host now renders all 7 (was 1/7).

- resolveTargetId retry: the fire-once 5s probe missed pages that committed late
  under burst (empty `webview snapshot`). Re-probe every 0.5s up to ~4.5s while
  still pending — a sibling of the 33858fb fix in the same targetId path.

Render is fixed; concurrent multi-tile agent-DRIVING still has a residual relay
issue (2nd+ tile's CDP page target not found) tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses every CRITICAL + HIGH finding from the 2026-06-22 robustness
audit of the shared-host (N browsers per profile) path. The shipped
0.18s create-pacing was a band-aid that lowered race probability without
fixing the structural causes; this replaces it with completion-driven
pacing and closes the failure-to-paint, shared-state, and teardown-race
gaps.

CRITICAL
- C1 never-painted recovery: a first-present watchdog per browser (Swift)
  re-kicks a repaint via a new kOpInvalidate op if no frame arrives in
  ~3s, then surfaces `paintStalled` to Dart (CefWebController.onPaintStalled)
  if still blank — a blank tile now self-heals or signals instead of
  staying blank forever with no breadcrumb. cef_host also Invalidate()s on
  main-frame OnLoadEnd. (Runtime software-paint *recreate* fallback left as
  a follow-up — the watchdog+signal addresses the unsignalled-blank gap and
  async create removes the GPU-starvation cause.)
- C2 per-session profile refusal: an ad-hoc host refusing its named profile
  now re-homes EVERY session on the shared host onto its own ephemeral host
  (was: a single clobbered handler recovered one tile and shut the shared
  host down, stranding all siblings), and blocks the profile so later
  creates skip the doomed host.
- C3 SendFrame write-after-close: snapshot the fd under the write mutex and
  exchange(-1)-before-close in teardown, so a paint thread can't write into
  a closed/recycled fd.

HIGH
- H1 reader join/close UAF: gate the join on readerStarted alone (not the
  racy wasRunning) and never close the fd on a join timeout (leak it).
- H2 "No page found" + cross-tile setAutoAttach leak: each CdpRelay now
  actively issues its own Target.attachToTarget and learns its page session
  from its own demuxed response (order-independent, idempotent), synthesizes
  a single attachedToTarget to the client, and no longer forwards the
  browser-wide setAutoAttach to the shared session.
- H3 async create: CreateBrowserSync (blocked the single CEF UI thread,
  serialized + GPU-contended a burst, wedged siblings on a hung create) ->
  async CefBrowserHost::CreateBrowser bound in OnAfterCreated, which acks
  kOpCreated so the host's pacer advances by COMPLETION (with a timeout
  backstop), retiring the 0.18s magic constant.
- H4 torn geometry read: sendCreate reads (w,h,dpr,surfaceId) as one atomic
  CefWebSession.createSnapshot(); resize publishes surface + dims atomically.
- H5 dual-waitpid: single-owner pid reaping — handleHostDeath takes the pid,
  hands it back only if it can't reap it; terminateProcess takes both
  handles under the write lock.
- H6 pacer state on death: clear createSendQueue / createPacerRunning /
  createInFlight in shutdown() and handleHostDeath(); gate the pump on a
  live host.
- H7 create-failure to Dart: kOpCreateFailed -> onBrowserFailed -> the
  plugin drops just that session and emits processGone(reason:createFailed).
- H8 browserId wrap: the debug-only assert is now a hard precondition (a
  free, non-reserved slot) so a wrap can't silently overwrite a live
  sibling's slot and misroute cross-tile paint/CDP.
- H9 malformed frame: log the rejected length/opcode (native + Swift) before
  tearing the host down, so it isn't a silent all-tiles exit.
- M1 pacer recursion: trampoline the disposed-skip instead of recursing.

Compiles clean: cef_host (cmake/ninja), the example macOS app (Swift+Dart),
and `flutter analyze lib` (no issues).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oss-client + watchdog)

Adversarial re-audit of 224829d (6 lenses, every finding refuted from 3 angles)
confirmed 9 gaps; this fixes all of them. The two HIGH are regressions the H3
async-create change itself introduced.

HIGH
- Orphan-browser leak on dispose-during-async-create: with CreateBrowserSync now
  async, a dispose landing between DoCreateBrowser and OnAfterCreated found
  slot->browser==null and silently no-op'd, so OnAfterCreated then bound a live
  browser nothing ever closed (leaked renderer + IOSurface until host shutdown).
  Slot now carries `close_requested`; DoDisposeBrowser records intent when the
  browser isn't bound yet and OnAfterCreated honors it (CloseBrowser immediately).
- CDP relay cross-client confusion (H2): the relay outlives a ws client, but the
  in-flight self-attach state (selfAttachPipeId / pending acks) wasn't reset on
  client detach, so a late attachToTarget response delivered a fabricated
  attachedToTarget + a stale setAutoAttach ack to the NEXT client. Reset the attach
  state on detach, stamp each client connection with a generation bumped at connect,
  and gate handleSelfAttachResponse's client writes on that generation. Also queue
  concurrent browser-level setAutoAttach acks (pendingAutoAttachClientIds list) so a
  second one before the first resolves no longer orphans an ack / leaks a pipeId.

MEDIUM
- C2 respawnHostEphemeral: a per-session ephemeral spawn failure `continue`d,
  stranding that session on the already-shut-down host (blank tile, no signal,
  leaked session+texture). Now emits processGone(reason:respawnFailed) + disposes.
- C1 watchdog false-positive on off-screen tiles: a hidden (WasHidden) browser
  produces no frames by design, so the watchdog flagged work_canvas's normal
  lazy-spawn/viewport-culled tiles as paintStalled after ~7s. The host now peeks
  opSetVisible, suspends the watchdog while hidden, and re-arms on unhide.

LOW
- Debug CDP-validation handler restored onCdpMessage to its captured `prior`,
  clobbering a relay fan-out chained on top (relays then received no pipe messages).
  Stop restoring — the handler short-circuits after `logged` and forwards via prior.
- First-present watchdog took presentLock on every (up to 60fps) present frame; now
  a per-session firstPresentSeen flag detects first paint under the browsersLock the
  reader already holds, so the cancel fires once per browser.

Compiles clean: cef_host (cmake/ninja), example macOS app (Swift), flutter analyze.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wenkaifan0720 wenkaifan0720 merged commit 5518f2b into main Jun 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant