Skip to content

[draft] Combined: bgfx MSAA+mipmaps fix + Canvas FB-pool survivability (CI smoke)#1715

Draft
bkaradzic-microsoft wants to merge 16 commits into
BabylonJS:masterfrom
bkaradzic-microsoft:weekend/2026-05-21-msaa-cascade
Draft

[draft] Combined: bgfx MSAA+mipmaps fix + Canvas FB-pool survivability (CI smoke)#1715
bkaradzic-microsoft wants to merge 16 commits into
BabylonJS:masterfrom
bkaradzic-microsoft:weekend/2026-05-21-msaa-cascade

Conversation

@bkaradzic-microsoft
Copy link
Copy Markdown
Contributor

Note

Draft PR for CI smoke-test only. Combines the two weekend feature branches into one tree to exercise all platform builds at once. Do not review or land from here — splits below.

This combined branch contains two independent fixes:

  1. MSAA + mipmaps fails on D3D11/D3D12/Vulkan (bgfx) #1714 — bgfx MSAA + mipmaps fix (bn-fix-msaa-mips on BabylonJS/bgfx, picked up via bgfx.cmake bump). Sets MipLevels=1 on the MSAA target texture in all four backends (D3D11, D3D12, Vulkan, Metal); the resolve target still gets m_numMips. Without this, MSAA-enabled render targets that also request mipmaps fail validation (E_INVALIDARG on D3D11/12, VUID on Vulkan, MTL validation on Metal). Three-level pin chain:

    • BabylonJS/bgfx branch bn-fix-msaa-mips @ b617ec8f4
    • BabylonJS/bgfx.cmake branch bn-bump-bgfx-msaa-mips @ 2bd6f6214 (advances bgfx submodule)
    • This branch advances CMakeLists.txt GIT_TAG for bgfx.cmake to 2bd6f6214
  2. Canvas framebuffer-pool survivability fix (cascade root cause). Many playground tests were marked excludeFromAutomaticTesting with reasons like "Suspected to cause Win32 D3D11 cascade crash". Empirically: the cascade is V8 GC pacing outpacing bgfx's default 128-slot framebuffer pool; whichever test ran when the pool finally exhausted got blamed, varying from run to run, so dozens of innocent tests accumulated the exclusion.

    This branch:

    • Bumps default BGFX_CONFIG_MAX_FRAME_BUFFERS from 128 → 512 in Dependencies/CMakeLists.txt and propagates CMake -D overrides to bgfx's compile defines (the existing CI -D BGFX_CONFIG_MAX_FRAME_BUFFERS=256 was previously a silent no-op).
    • Replaces three bare assert(handle.idx != bgfx::kInvalidHandle) sites with std::runtime_error throws + texture cleanup, so pool exhaustion no longer abort()s the process. The Canvas Context::Flush JS binding now catches and rethrows as Napi::Error so just the offending test fails instead of killing the whole sweep.
    • Re-enables 112 playground tests in config.json that were marked excluded as cascade victims. All 112 verified to pass when run sequentially as one batch on Win32 D3D11 Release with the FB pool bump.

CI expectations

This is the first time these two changes have been exercised on the non-Win32-D3D11 backends. Things to watch for:

  • bgfx MSAA + mipmaps fix: validation layers on D3D12 / Vulkan / Metal should all be clean now where they previously errored. If any backend still errors, the fix needs further work in the respective renderer_*.cpp.
  • Canvas FB-pool survivability: tests should pass without --include-excluded. The 112 re-enabled tests' new validated lines should appear in the sweep output. A test that genuinely fails on a non-Win32-D3D11 backend (real rendering bug, not cascade) will show up as a new pixel-diff or JS error in that backend's CI log.

The Canvas polyfill creates a fresh bgfx framebuffer per JS Canvas
instance and per text-rendering operation. Under V8's GC pacing,
finalizers (which release the framebuffers) only run in batches, so
a long playground sweep can outpace the default 128-slot bgfx pool.
The first allocation past the cliff returned BGFX_INVALID_HANDLE,
the bare assert() at Canvas.cpp tripped (Debug) or the invalid handle
silently bled into Graphics::FrameBuffer/bgfx-internal arrays (Release,
later AV at bgfx_p.h:5526). The whole run aborted regardless of which
test happened to be unlucky.

Three changes:

- Dependencies/CMakeLists.txt now propagates BGFX_CONFIG_MAX_FRAME_BUFFERS
  to the bgfx target as a compile definition, defaulting to 512 (4x the
  bgfx default) and honoring any -D override the way CI workflows
  already pass it. Previously the CI -D was a no-op.

- Canvas.cpp UpdateRenderTarget() now throws std::runtime_error when
  createFrameBuffer returns kInvalidHandle and frees the two textures
  it just allocated. Context::Flush wraps the call and rethrows as
  Napi::Error so the offending test fails cleanly without taking the
  rest of the sweep down.

- FrameBufferPool::Add gets the same treatment for symmetry.

The pool bump alone is expected to remove the cascade in practice; the
throw paths are defense-in-depth so future regressions surface as a
catchable JS error instead of a silent process kill.

Refs BabylonJS/BabylonNative cascade-victim discussion (113 playground
tests previously excluded with reasons like "Suspected to corrupt state
causing later D3D11 cascade crash").
@bkaradzic-microsoft bkaradzic-microsoft force-pushed the weekend/2026-05-21-msaa-cascade branch 5 times, most recently from 6d2a348 to b467942 Compare May 21, 2026 22:19
Re-enables 112 tests previously excluded with "Suspected to corrupt state
causing later Win32 D3D11 cascade crash" reasons, now that PR BabylonJS#1573
(Canvas framebuffer pool survivability fix) makes the in-suite cascade
failure mode survivable.

For 21 of the 112 cascade-victims whose original reason also mentioned
Linux/OpenGL or D3D11 issues, re-add "excludedGraphicsApis: ["OpenGL"]"
or keep "excludeFromAutomaticTesting: true" so they stay excluded on the
affected backend(s):

  OpenGL-only exclusion (15 tests):
    47 Fresnel, 54 Scissor test w/ 0.9 HW scaling, 101 Gaussian Splatting
    viewports, 137 Volumetric Light Scattering PP w/ Morph Targets,
    159 Node material 0, 161 Node material 2, 163 Node material 4,
    165 Node material 6, 166 Node material 7, 189 Fresnel,
    240 Depth of field, 258 Local cubemaps,
    312 Scissor test w/ 0.9 HW scaling, 378 SerializeScene + ImportMesh
    w/ MorphTargetManager, 385 Command encoder order in WebGPU 2.

  Excluded on both D3D11 and OpenGL (6 tests):
    358 Roundtrip babylon file (skeletal/morph), 397 simple-sphere-in-4-
    mirrors, 398 procedural-texture-with-node-material, 400 simple-
    custom-shader, 409 lens-flare, 410 change-texture-of-material.

Also fixes Dependencies/CMakeLists.txt:62 to guard the BGFX_CONFIG_MAX_
FRAME_BUFFERS bump with "if(NOT X)" instead of "if(NOT DEFINED X)"
since bgfx.cmake declares the variable as "" CACHE STRING (defined but
empty). Without this fix, all non-Win32 CI builds compiled with an
empty -D flag and broke bgfx_p.h template instantiation.
@bkaradzic-microsoft bkaradzic-microsoft force-pushed the weekend/2026-05-21-msaa-cascade branch from b467942 to 9dac136 Compare May 21, 2026 22:57
bkaradzic-microsoft added a commit to BabylonJS/bgfx that referenced this pull request May 22, 2026
The SortKey struct in src/bgfx_p.h declared its members with no inline
initializer and relied on every caller to invoke reset() before reading
m_hasAlphaRef, m_blend, etc. via encodeDraw(). On at least one code path
exercised by BabylonNative's Playground sweep, encodeDraw(SortSequence)
runs before reset() / setState(), causing UndefinedBehaviorSanitizer to
fire with:

    bgfx_p.h:1339:42: runtime error: load of value 190, which is not a
    valid value for type 'bool'

at the m_hasAlphaRef read. Add in-class default initializers to all six
members so a default-constructed SortKey is in a valid (zeroed) state.
The reset() method continues to work unchanged.

Reported on the application side as BabylonJS/BabylonNative#1715
(MacOS_Sanitizers job).
@bkaradzic-microsoft bkaradzic-microsoft force-pushed the weekend/2026-05-21-msaa-cascade branch 2 times, most recently from 920b18e to a720c85 Compare May 22, 2026 00:44
…xes (BabylonJS#1714).

bgfx.cmake (BabylonJS) branch: bn-bump-bgfx-msaa-mips-and-ubsan @ 4a53db5
- bgfx submodule -> 89dcbaea (b617ec8f MSAA+mips + cherry-picked upstream
  bgfx PR #3688 UBSan: SortKey reset() in EncoderImpl ctor + BitMask
  shift-by-width guard)
- bx and bimg unchanged from BabylonJS/bgfx.cmake master
@bkaradzic-microsoft bkaradzic-microsoft force-pushed the weekend/2026-05-21-msaa-cascade branch from a720c85 to 1c1942e Compare May 22, 2026 00:58
…scade-victim re-enable

After re-enabling 113 cascade-victim tests (TPC BabylonJS#1573), two shadow tests fail
backend-specifically — they were previously hidden by the cascade-victim blanket
exclusion. Re-exclude with narrow excludedGraphicsApis tags so the rest of the
cascade re-enable can land:

  - pillars-sphere-and-torus-with-PCSS-shadows: now excluded on D3D11 too
    (was OpenGL-only). Passes on D3D12, UWP, Metal, Vulkan, so the PCSS path
    has a D3D11-specific pixel diff that needs separate investigation.

  - Shadows in RHS mode: newly excluded on OpenGL. Passes on every other
    backend (D3D11, D3D12, UWP, Metal, Vulkan), so it's an OpenGL-specific
    pixel diff.

Both should be tracked as new TPC issues for follow-up renderer work.
…y cascade-victim re-enable

Iter 9 surfaced two more backend-specific test failures (same pattern as
pillars / Shadows-in-RHS): re-enabling the 113 cascade-victim tests after
TPC BabylonJS#1573 exposed latent backend-specific pixel diffs.

  - torus-knot-mirror: now excluded on D3D11 too (was OpenGL-only). Mirror
    pixel diff that passes on D3D12/UWP/Mac/iOS.

  - 'Load environment from data buffer': newly excluded on OpenGL. Passes
    on every other backend.

Each will get a follow-up TPC issue.
…3D11

Re-enabling the 113 TPC BabylonJS#1573 cascade-victim tests has surfaced a sequence
of latent backend-specific pixel diffs. Iterating one-failure-at-a-time
(validation_native.js TestUtils.exit(-1) on first failure) is slow and
hides the rest. Switch the runner to keep-going-on-failure by default so a
single CI pass enumerates EVERY failing test:

  - Add 'stopOnFirstFailure' opt (default false) for callers that still
    want the old halt-on-failure behavior (interactive bisects).
  - Loop now increments ranCount and failedCount but only stops on first
    failure when stopOnFirstFailure is set; otherwise advances to next
    test and reports the full count in the final summary.
  - Final TestUtils.exit code is -1 when any test failed, 0 otherwise.

Also re-exclude 'Shadows in RHS mode' on D3D11 (was OpenGL-only). Surfaced
as the next failure on D3D11 in iter 10 once pillars/torus-knot-mirror were
excluded.
Continue-on-fail mode (in the previous commit) enumerated the full damage
from the cascade-victim re-enable (98bd780 'Re-enable 112 cascade-victim
playground tests'). Of the 112 tests re-enabled, 66 fail on the only two
platforms that run Playground tests (Win32 D3D11 and Ubuntu OpenGL):

  - 27 'Particles - ...' tests (Basic Properties, Change, Animations,
    Ramp Gradient, Emitters, Directed Sphere)
  - 30 'OpenPBR Analytic Lights ...' tests (Coat, Specular, Transmission,
    Fuzz, Anisotropy, Thin Walled)
  - 4 'Vertex Pulling - ...' tests
  - 5 misc: 'Flow Graph multiple contexts', 'glTF Loader Capabilities',
    'Load environment from data buffer', 'Selection outline layer with
    instances and LOD', 'Specular reflectance models for analytic lights',
    'Synchronous Effect', 'Test code inlining', 'UV2 Morphing', 'WebGPU
    async pipeline pre-warming'

Of the 66, 63 fail on both D3D11 AND OpenGL; only 3 are D3D11-only
('Load environment from data buffer', 'UV2 Morphing', 'Vertex Pulling -
Morph Targets and Bones'). Treating them as backend-specific would have
required per-backend tags on most; in practice these tests have actual
per-test rendering bugs that the cascade-victim reason was masking, so the
correct fix is to restore excludeFromAutomaticTesting with a new reason
that records (a) cascade re-enable surfaced them and (b) per-test
investigation is still needed.

The 46 cascade-victims that DO pass remain enabled, so PR BabylonJS#1573 keeps
real value: nearly half the cascade-victim test set is back online.

The 3 iter-9/10 narrow exclusions (pillars-sphere-and-torus-with-PCSS-
shadows, torus-knot-mirror, Shadows in RHS mode) stay on
excludedGraphicsApis: [OpenGL, D3D11] since those tags already cover both
run-platforms.

Follow-up TPC issues should be filed for the cluster of OpenPBR /
Particles / Vertex Pulling failures since they likely share root causes.
BabylonJS/bgfx.cmake#117 lands the bgfx submodule bump on master, pulling
in MSAA+mips and UBSan SortKey/BitMask fixes that were previously only
available on the 'bn-bump-bgfx-msaa-mips-and-ubsan' working branch
(4a53db51). PR BabylonJS#1715 no longer depends on a feature branch.

  bgfx.cmake e5f3f31c -> 98531b55
  bgfx submodule         -> 210731cd (BabylonJS/bgfx master)
Previous bump to 98531b55 hit unresolved <<<<<<< HEAD merge conflict in
BabylonJS/bgfx master 210731cd's src/renderer_d3d11.cpp (line 4607) that
broke compilation on every platform. BabylonJS/bgfx.cmake#119 bumps the
bgfx submodule to e043d15d which keeps the upstream-master side of the
conflict (MipLevels save/restore + the desc.SampleDesc = s_msaa[0] BN
MSAA fix), so the MSAA+mips fix is preserved.

  bgfx.cmake 98531b55 -> 5a00e8df
  bgfx submodule         -> e043d15d (was 210731cd, conflict resolved)
Two breakages from BabylonJS/bgfx.cmake#119 (bgfx submodule e043d15d,
bx upstream 4eb6f8eb):

1. bx removed uint32_cnttz / uint32_min in favour of templated
   countTrailingZeros / min. Update nanovg_babylon.cpp's
   glnvg_convertBlendFuncFactor accordingly. Matches upstream
   bgfx/examples/common/nanovg/nanovg_bgfx.cpp.

2. bx's simd128_sse.inl now uses SSE4.1 intrinsics (_mm_blendv_ps,
   __builtin_ia32_roundps) unconditionally on x86/x64. bx's
   scripts/toolchain.lua passes -msse4.2 for linux-gcc/clang, but
   BabylonJS/bgfx.cmake's CMake side does not. Mirror that flag on
   the bx target for Linux GCC/Clang to avoid 'needs target feature
   sse4.1' errors. MSVC x64 has SSE4.1 by default. This workaround
   should eventually move to BabylonJS/bgfx.cmake/cmake/bx/bx.cmake.
bx upstream 4eb6f8eb moved countTrailingZeros / min into bx/math.h,
which BN's nanovg_babylon.cpp doesn't include (in contrast to upstream
bgfx/examples/common/nanovg/nanovg_bgfx.cpp which does).
After bgfx submodule bump to e043d15d (BabylonJS/bgfx.cmake#119),
WebP decoding fails on every platform that runs Playground:
- Ubuntu (Clang/GCC, OpenGL): bimg ASSERT 'Unrecognized image format.'
  at image_decode.cpp:1015 (Trace/breakpoint trap, exit 133).
- Win32 D3D11 (standard, JSI, V8): process exits with code 1 mid-test
  after the WebP texture is created.

Needs per-test investigation: bimg's imageParse path doesn't include
imageParseWebP (it never did - WebP loading must come from a different
path in BN or libwebp polyfill). Possibly bx/bimg ABI change broke
the libwebp integration.
After bumping bgfx (and thus bimg) to e043d15d, the new bimg::imageParse

now explicitly sets an error in the caller's bx::Error when no parser

recognises the format. When the caller passes NULL (the default), the

BX_ERROR_SCOPE temp Error is monitored, and the scope destructor asserts

if it is non-OK. WebP is not recognised by any of bimg's parsers, so

every call site that fell back to libwebp via the WEBP code path now

crashes with 'Unrecognized image format.'.

Fix this by passing bx::ErrorIgnore explicitly to imageParse, which

swallows any error set by bimg's parsers and lets the existing NULL

check / WebP fallback proceed as before.

Re-enable EXT_texture_webp in the Playground config; the WebP test

should now pass on platforms where libwebp is enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant