fix(cef): shared-host burst-render + full robustness hardening (audit + re-audit)#6
Merged
Merged
Conversation
…red-host burst) A burst of opCreateBrowser on one shared cef_host handed its single CEF UI thread a pile of blocking CreateBrowserSync calls that serialized + contended the one shared GPU/Viz accelerated-surface handshake — later browsers got no surface and never painted (blank tile), and their targetId resolve wedged behind the create backlog. Two Swift-side fixes (CefProfileHost): - Per-host create pacing: createSendQueue + pumpCreateQueue send opCreateBrowser one at a time, spaced 0.18s, instead of all at once — each browser's create + surface handshake completes before the next contends the GPU process. Verified live: a 7-cefWebview burst on one shared host now renders all 7 (was 1/7). - resolveTargetId retry: the fire-once 5s probe missed pages that committed late under burst (empty `webview snapshot`). Re-probe every 0.5s up to ~4.5s while still pending — a sibling of the 33858fb fix in the same targetId path. Render is fixed; concurrent multi-tile agent-DRIVING still has a residual relay issue (2nd+ tile's CDP page target not found) tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses every CRITICAL + HIGH finding from the 2026-06-22 robustness audit of the shared-host (N browsers per profile) path. The shipped 0.18s create-pacing was a band-aid that lowered race probability without fixing the structural causes; this replaces it with completion-driven pacing and closes the failure-to-paint, shared-state, and teardown-race gaps. CRITICAL - C1 never-painted recovery: a first-present watchdog per browser (Swift) re-kicks a repaint via a new kOpInvalidate op if no frame arrives in ~3s, then surfaces `paintStalled` to Dart (CefWebController.onPaintStalled) if still blank — a blank tile now self-heals or signals instead of staying blank forever with no breadcrumb. cef_host also Invalidate()s on main-frame OnLoadEnd. (Runtime software-paint *recreate* fallback left as a follow-up — the watchdog+signal addresses the unsignalled-blank gap and async create removes the GPU-starvation cause.) - C2 per-session profile refusal: an ad-hoc host refusing its named profile now re-homes EVERY session on the shared host onto its own ephemeral host (was: a single clobbered handler recovered one tile and shut the shared host down, stranding all siblings), and blocks the profile so later creates skip the doomed host. - C3 SendFrame write-after-close: snapshot the fd under the write mutex and exchange(-1)-before-close in teardown, so a paint thread can't write into a closed/recycled fd. HIGH - H1 reader join/close UAF: gate the join on readerStarted alone (not the racy wasRunning) and never close the fd on a join timeout (leak it). - H2 "No page found" + cross-tile setAutoAttach leak: each CdpRelay now actively issues its own Target.attachToTarget and learns its page session from its own demuxed response (order-independent, idempotent), synthesizes a single attachedToTarget to the client, and no longer forwards the browser-wide setAutoAttach to the shared session. - H3 async create: CreateBrowserSync (blocked the single CEF UI thread, serialized + GPU-contended a burst, wedged siblings on a hung create) -> async CefBrowserHost::CreateBrowser bound in OnAfterCreated, which acks kOpCreated so the host's pacer advances by COMPLETION (with a timeout backstop), retiring the 0.18s magic constant. - H4 torn geometry read: sendCreate reads (w,h,dpr,surfaceId) as one atomic CefWebSession.createSnapshot(); resize publishes surface + dims atomically. - H5 dual-waitpid: single-owner pid reaping — handleHostDeath takes the pid, hands it back only if it can't reap it; terminateProcess takes both handles under the write lock. - H6 pacer state on death: clear createSendQueue / createPacerRunning / createInFlight in shutdown() and handleHostDeath(); gate the pump on a live host. - H7 create-failure to Dart: kOpCreateFailed -> onBrowserFailed -> the plugin drops just that session and emits processGone(reason:createFailed). - H8 browserId wrap: the debug-only assert is now a hard precondition (a free, non-reserved slot) so a wrap can't silently overwrite a live sibling's slot and misroute cross-tile paint/CDP. - H9 malformed frame: log the rejected length/opcode (native + Swift) before tearing the host down, so it isn't a silent all-tiles exit. - M1 pacer recursion: trampoline the disposed-skip instead of recursing. Compiles clean: cef_host (cmake/ninja), the example macOS app (Swift+Dart), and `flutter analyze lib` (no issues). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oss-client + watchdog) Adversarial re-audit of 224829d (6 lenses, every finding refuted from 3 angles) confirmed 9 gaps; this fixes all of them. The two HIGH are regressions the H3 async-create change itself introduced. HIGH - Orphan-browser leak on dispose-during-async-create: with CreateBrowserSync now async, a dispose landing between DoCreateBrowser and OnAfterCreated found slot->browser==null and silently no-op'd, so OnAfterCreated then bound a live browser nothing ever closed (leaked renderer + IOSurface until host shutdown). Slot now carries `close_requested`; DoDisposeBrowser records intent when the browser isn't bound yet and OnAfterCreated honors it (CloseBrowser immediately). - CDP relay cross-client confusion (H2): the relay outlives a ws client, but the in-flight self-attach state (selfAttachPipeId / pending acks) wasn't reset on client detach, so a late attachToTarget response delivered a fabricated attachedToTarget + a stale setAutoAttach ack to the NEXT client. Reset the attach state on detach, stamp each client connection with a generation bumped at connect, and gate handleSelfAttachResponse's client writes on that generation. Also queue concurrent browser-level setAutoAttach acks (pendingAutoAttachClientIds list) so a second one before the first resolves no longer orphans an ack / leaks a pipeId. MEDIUM - C2 respawnHostEphemeral: a per-session ephemeral spawn failure `continue`d, stranding that session on the already-shut-down host (blank tile, no signal, leaked session+texture). Now emits processGone(reason:respawnFailed) + disposes. - C1 watchdog false-positive on off-screen tiles: a hidden (WasHidden) browser produces no frames by design, so the watchdog flagged work_canvas's normal lazy-spawn/viewport-culled tiles as paintStalled after ~7s. The host now peeks opSetVisible, suspends the watchdog while hidden, and re-arms on unhide. LOW - Debug CDP-validation handler restored onCdpMessage to its captured `prior`, clobbering a relay fan-out chained on top (relays then received no pipe messages). Stop restoring — the handler short-circuits after `logged` and forwards via prior. - First-present watchdog took presentLock on every (up to 60fps) present frame; now a per-session firstPresentSeen flag detects first paint under the browsersLock the reader already holds, so the cancel fires once per browser. Compiles clean: cef_host (cmake/ninja), example macOS app (Swift), flutter analyze. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes the multi-browser-per-host (shared-profile) path production-robust. Three
stacked commits — the shared-host burst-render fix it builds on, the hardening, and
the re-audit follow-ups. Supersedes #5 (its commit is the base of this branch).
1. Shared-host burst-render (foundation — was #5)
Pace per-host browser creates + retry targetId resolution, so a burst of tiles on one
shared host renders instead of leaving later tiles blank.
2. Robustness hardening — all 3 CRITICAL + 9 HIGH from a 6-agent audit
kOpInvalidatere-kick →paintStalledto Dart; cef_hostInvalidate()on load-end.(was: shut the shared host down, stranding siblings).
SendFramesnapshots the fd under the write mutex;exchange(-1)-before-close.readerStartedalone; leak-not-close fd on join timeout.CdpRelayactively issuesTarget.attachToTarget, learns its sessionfrom its own demuxed response, synthesizes one
attachedToTarget, stops forwardingbrowser-level
setAutoAttach(fixes "No page found" + a cross-tile control leak).CreateBrowserbound inOnAfterCreated→kOpCreated→ completion-driven pacer (with timeout backstop), replacing the 0.18s magic constant.
kOpCreateFailed→ drop just that session (processGone).browserId-wrap is a hard precondition (was a debug assert).3. Re-audit (6 lenses, every finding refuted from 3 angles → 9 confirmed, 6 rejected)
Two HIGH were regressions the H3 async-create change itself introduced:
Slot.close_requested,recorded by
DoDisposeBrowserand honored inOnAfterCreated.detach + a per-connect client generation gating the response write + queued
concurrent-
setAutoAttachacks.processGone+disposes (was: stranded the session).paintStalledfor hidden/off-screentiles — the normal lazy-spawn pattern).
onCdpMessagerestore no longer clobbers a chained relay fan-out.presentLockfolded into thebrowsersLockthe reader already holds.Verification
Compiles clean: cef_host (cmake/ninja), the example macOS app (Swift+Dart), the full
firebase-extended Campus build (bundling the rebuilt cef_host), and
flutter analyze.The single-tile example renders end-to-end through the async-create path. Multi-tile
burst + concurrent agent-driving (H2) are compile-verified; full behavioral coverage
needs a live multi-tile Campus session.
🤖 Generated with Claude Code