Optionally store source maps as VLQ encoded (2/2): Transformer output, unstable_compactSourceMaps (#1743)#1743
Open
robhogan wants to merge 4 commits into
Open
Optionally store source maps as VLQ encoded (2/2): Transformer output, unstable_compactSourceMaps (#1743)#1743robhogan wants to merge 4 commits into
unstable_compactSourceMaps (#1743)#1743robhogan wants to merge 4 commits into
Conversation
Contributor
|
@robhogan has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109216060. |
Summary: Scripts and findings for profiling Metro's memory and CPU during bundling, and an end-to-end benchmark of the compact VLQ source-map work stacked on top. **Methodology:** - Start Metro with `NODE_ARGS="--expose-gc --inspect=9230" DEV=1 js1 run --prefetch=false` - WildeBundle URL: `GET http://localhost:8081/xplat/js/RKJSModules/EntryPoints/WildeBundle.bundle?platform=ios&dev=true&app=com.facebook.Wilde` - RSS profiling via /proc, heap snapshots via Chrome DevTools Protocol - Graph freed via DELETE to the bundle URL (same as fill-http-cache) **Scripts added:** - `fb-metro-cli/memory-investigation/heap-profile.js` — Automated CDP-based profiler: captures 3 heap snapshots (baseline, post-build, post-delete) and compares them - `fb-metro-cli/memory-investigation/heap-compare.js` — Standalone snapshot comparator with streaming parser for multi-GB .heapsnapshot files - `fb-metro-cli/memory-investigation/heap-injector.js` — Optional in-process module exposing /memory, /gc, /snapshot HTTP endpoints - `metro/scripts/profile-memory.sh` — Quick RSS-only profiling via /proc - `fb-metro-cli/memory-investigation/compact-bench-measure.js` — One measurement cycle: builds WildeBundle, then requests WildeBundle.map, recording memory (RSS/heap) + build CPU + .map serialize CPU via CDP - `fb-metro-cli/memory-investigation/run-compact-bench.sh` — Orchestrator: fresh Metro per repeat across three configs (base / compact_flat / compact_indexed), cold or warm cache - `fb-metro-cli/memory-investigation/compact-bench-stats.js` — Welch t-test analysis between any two configs - `fb-metro-cli/memory-investigation/README.md`, `compact-sourcemaps-benchmark-results.md` — Full writeup of methodology and results **Baseline results (WildeBundle, June 2025):** - Startup: 819 MB RSS / 426 MB heap used - Post-build: 2,338 MB RSS / 1,549 MB heap used (+1,122 MB heap) - Post-delete: 507 MB heap used (DELETE frees 93% of build growth) - Arrays dominate: 10M Array objects + backing stores = 858 MB (77% of growth) - Source maps stored as decoded number-tuple arrays are the primary consumer: ~678 MB, 60% of build growth (9,866,476 tuples across 16,562 modules) **Compact source maps — end-to-end benchmark (n=3, WildeBundle):** Three configs: `base` (decoded tuples), `compact_flat` (VLQ storage, flat .map), `compact_indexed` (VLQ storage, indexed passthrough .map). - Memory (both compact configs): heap −51% cold / −53% warm; RSS −48% (1654→810 MB heap cold; all Welch p < 1e-5). - Build CPU: unchanged cold; ~20% faster warm with compact storage. - Serialize CPU (`.map` request): `compact_flat` +18% vs base (decode + re-encode), `compact_indexed` −49% vs base (passthrough). Flat .map is byte-identical to base; indexed .map is +3.4% larger. Bundle output byte-identical across all configs. Full tables in `compact-sourcemaps-benchmark-results.md`. Differential Revision: D107879392
Summary:
The transform worker built its source-map tuples via
`result.rawMappings.map(toSegmentTuple)`. Accessing `result.rawMappings` forces
`babel/generator` to run a second decode (`allMappings`) that allocates a flat
array of ~4-5 objects per segment — even though Babel *already* computed an
equivalent decoded map (`result.decodedMap`, the jridgewell/gen-mapping decoded
format) eagerly during generation and Metro was discarding it.
This swaps the source to `result.decodedMap` via a new
`tuplesFromBabelDecodedMap` (decoded source lines are 0-based -> +1, name indices
resolved against `decodedMap.names`). Output is byte-identical to
`result.rawMappings.map(toSegmentTuple)`, and it eliminates the redundant
`allMappings` decode for *every* build (not just compact source maps).
This is a standalone, unconditional improvement, so it sits first in the stack
ahead of the compact-source-map work, which builds on it.
- `metro-source-map`: add `BabelDecodedMap` type + `tuplesFromBabelDecodedMap`.
- `metro-transform-worker`: source tuples from `result.decodedMap`.
- `babel_v7.x.x` libdef: add `decodedMap` to `GeneratorResult`.
Microbenchmark (real `babel/generator` 7.29.1, 133 modules / ~30.6K segments,
`--expose-gc`, median of 11): `generate()` alone 20.2 ms; `generate()` + access
`decodedMap` 19.2 ms (~0 delta — it's a sunk, eager cost); `generate()` + access
`rawMappings` 28.8 ms (+8.6 ms) with ~40% more heap (19.5 vs 13.9 MB). So
consuming `decodedMap` drops the `rawMappings`/`allMappings` decode entirely.
(`decodedMap` is eager in 7.29.1; even if a future Babel makes it lazy it
allocates arrays-of-numbers vs `rawMappings`' nested objects, so it stays <=.)
## E2E benchmark — cold WildeBundle (this diff vs baseline = parent)
Interleaved, paired A/B: each of 12 rounds runs one cold build per cell —
{baseline, this diff} x {child-process workers, worker threads} — so slow
machine drift is shared within each round and cancels in the per-round delta.
Fresh Metro per build, transform cache wiped (cold), `maxWorkers=16`, default
path (no compact source maps). "Transform CPU" = total user+sys CPU across the
whole worker process tree; "tree RSS" = whole-tree resident set (captures
workers in both modes); "graph heap" = main-isolate heapUsed post-build (the
retained module graph). base/this-diff columns are medians; Δ is the paired
mean with a 95% CI (Student-t, 11 df); "n.s." = CI includes 0.
Child-process workers (Metro default; 12 paired rounds):
| metric | baseline | this diff | Δ (95% CI) |
|---|---|---|---|
| transform CPU (s) | 625 | 612 | **-16.6 (-2.6%) [-24.7, -8.5]** |
| build wall (s) | 65.9 | 65.6 | -0.5 (-0.7%) n.s. |
| transient tree RSS (GB) | 15.8 | 16.0 | +0.06, n.s. |
| post-build tree RSS (GB) | 15.1 | 15.1 | +0.08, n.s. |
| graph heap, main isolate (GB) | 1.59 | 1.59 | ~0, n.s. |
Worker threads (`unstable_workerThreads`; 12 paired rounds):
| metric | baseline | this diff | Δ (95% CI) |
|---|---|---|---|
| transform CPU (s) | 664 | 653 | -18.6 (-2.8%) [-37.5, +0.3] |
| build wall (s) | 59.8 | 59.5 | -1.2 (-1.9%) n.s. |
| transient RSS (GB) | 13.2 | 12.7 | -0.46 (-3.5%) [-0.81, -0.11] |
| post-build RSS (GB) | 12.3 | 11.9 | -0.45 (-3.7%) [-0.80, -0.10] |
| graph heap, main isolate (GB) | 1.60 | 1.60 | ~0, n.s. |
Takeaways:
- **Transform CPU drops ~2.6-2.8%, equally in both worker modes** — the point
estimates (-16.6 s child-process, -18.6 s threads) agree to within 2 s and
their CIs overlap almost entirely, so there is no real asymmetry. This is
exactly what the mechanism predicts: the optimization runs *inside* the worker
(consume `decodedMap` instead of forcing the `rawMappings`/`allMappings`
decode), so the saving is identical whether the worker is a child process or a
thread. (An earlier small-n pass suggested a child-process-only win; that was
sampling noise — threads-mode CPU is just noisier, SD 30 s vs 13 s, which only
widens its CI without moving the point estimate.)
- Build wall time is ~1-2% lower in both modes but within noise — the CPU saving
is spread across 16 workers, so it moves the critical path little.
- Main-isolate post-build heap (the retained graph of stored tuples) is
unchanged in every config — no memory regression, byte-identical output.
- Transient/post tree RSS shows a ~0.5 GB (~3.5%) reduction that is resolvable
only in the lower-variance threads configuration; the noisier child-process
configuration (RSS ~16 GB, CI half-width ~0.3 GB) cannot corroborate it, so
treat it as suggestive, not established.
Harness: `memory-investigation/run-worker-bench-ab.sh` (interleaved A/B) +
`worker-bench-measure.js` + `worker-bench-stats.js` (paired CIs), in the base
diff of this stack. Worker-threads mode under `js1 run` is GK-gated
(`metro_worker_threads`); benched via a local `FORCE_WORKER_THREADS` override
(not committed).
Reviewed By: huntie, GijsWeterings
Differential Revision: D108506323
…sumer support (#1742) Summary: ## This stack Decoded tuple arrays are the single largest contributor to Metro's dev-server heap on large bundles (~10 million retained small arrays on FBiOS entry bundle, for example). Storing the same data as a compact VLQ string instead removes most of that footprint. This reduces source map memory by ~51% on the heap and ~48% RSS for that ~16K module bundle. The emitted whole-bundle source map is unchanged. When a module's map is stored as VLQ, `fromRawMappings` decodes it back to tuples just-in-time, with request-scoped caching. The trade-off is therefore decode + re-encode CPU when a `.map` is actually requested or `/symbolicate` request is made. A plain `string` is used for `mappings` for now, since VLQ is ASCII by design. A `UInt8Array` would be marginally more efficient and potentially transferrable to/from worker threads, but would require more invasive changes to cache (de)serialisation. I did some benchmarking with this and it doesn't justify the complexity right now. ## This diff Adds a `VlqMap` type (`{mappings: string, names: ReadonlyArray<string>}`) as an alternative to the current `Array<MetroSourceMapSegmentTuple>` for storing per-module source maps in `Module` graph nodes (and transform results, and cache artifacts). Adds the ability to store, thread, decode and (flat-)emit VLQ maps - **nothing actually produces them yet**, so these code paths are unused except by tests. The opt-in producer flag lands in the next diff. ## Follow up After this mini-stack, we'll add an opt-in for emitting index source maps, directly re-using per-module VLQ and eliminating the trade-off mentioned above. Reviewed By: huntie, javache Differential Revision: D107973884
…, `unstable_compactSourceMaps` (#1743) Summary: ## This stack Decoded tuple arrays are the single largest contributor to Metro's dev-server heap on large bundles (~10 million retained small arrays on FBiOS entry bundle, for example). Storing the same data as a compact VLQ string instead removes most of that footprint. This reduces source map memory by ~51% on the heap and ~48% RSS for that ~16K module bundle. The emitted whole-bundle source map is unchanged. When a module's map is stored as VLQ, `fromRawMappings` decodes it back to tuples just-in-time, with request-scoped caching. The trade-off is therefore decode + re-encode CPU when a `.map` is actually requested or `/symbolicate` request is made. A plain `string` is used for `mappings` for now, since VLQ is ASCII by design. A `UInt8Array` would be marginally more efficient and potentially transferrable to/from worker threads, but would require more invasive changes to cache (de)serialisation. I did some benchmarking with this and it doesn't justify the complexity right now. ## This diff Adds `unstable_compactSourceMaps` (default `false`). When enabled, the transform worker stores each module's source map as a compact VLQ string (`VlqMap`) instead of a decoded `Array<MetroSourceMapSegmentTuple>`. Each module's map originates from one of three sources, so we encode the VLQ the cheapest way available in each case (all byte-identical to the decoded-tuple output): - transformJS, not minifying (the dominant path — Hermes targets don't minify): encode the `VlqMap` straight from `result.decodedMap`, which `babel/generator` computes eagerly while generating, via `vlqMapFromBabelDecodedMap` — never materialising tuples. - transformJS, minifying: the minifier returns its own map (not Babel's), so we re-encode the resulting tuples with `vlqMapFromTuples`. - transformJSON: builds tuples directly (no Babel generate), so it likewise re-encodes with `vlqMapFromTuples`. `countLines` is split out of `countLinesAndTerminateMap` so the decoded-map fast path can compute the terminating mapping without building and terminating a tuple array first. ## Benchmarks *Cold cache (n=3, means)* | Metric | base | compact | |---|---|---|---| | **Heap used** | 1653.7 MB | **809.7 MB (−51.0%)** | | **RSS** | 1854.2 MB | 955.2 MB (−48.5%) | | Heap growth (build) | 1606.5 MB | 761.2 MB (−52.6%) | | Build CPU (`.bundle`) | 23.05 s | 22.42 s (n.s.) | | **Serialize CPU (`.map`)** | 11.99 s | **14.19 s (+18.4%)** | *Warm cache (n=3, means)* | Metric | base | compact | |---|---|---|---| | **Heap used** | 1552 MB | **731 MB (−52.9%)** | | **RSS** | 1775 MB | 923 MB (−48.0%) | | Build CPU (`.bundle`) | 10.92 s | 8.86 s (−18.9%) | | **Serialize CPU (`.map`)** | 11.87 s | **13.89 s (+17.0%)** | ## Why behind a flag? 1) The `map` structure is exposed to custom serialisers, so changing it is semver-breaking. Landing this as experimental opt-in in a non-breaking release allows integrators to experiment with it. 2) This is a trade-off of retained memory vs CPU required to emit a flat source map or symbolicate errors. The trade-off largely goes away with indexed maps (coming next) - but that is a semver-breaking change to output. Changelog: ``` - **[Experimental]**: Add `unstable_compactSourceMaps` to use a more memory-efficient source map format. ``` Differential Revision: D109216060
d51004e to
b3f9840
Compare
unstable_compactSourceMapsunstable_compactSourceMaps (#1743)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This stack
Decoded tuple arrays are the single largest contributor to Metro's dev-server heap on large bundles (~10 million retained small arrays on FBiOS entry bundle, for example). Storing the same data as a compact VLQ string instead removes most of that footprint.
This reduces source map memory by ~51% on the heap and ~48% RSS for that ~16K module bundle.
The emitted whole-bundle source map is unchanged. When a module's map is stored as VLQ,
fromRawMappingsdecodes it back to tuples just-in-time, with request-scoped caching. The trade-off is therefore decode + re-encode CPU when a.mapis actually requested or/symbolicaterequest is made.A plain
stringis used formappingsfor now, since VLQ is ASCII by design. AUInt8Arraywould be marginally more efficient and potentially transferrable to/from worker threads, but would require more invasive changes to cache (de)serialisation. I did some benchmarking with this and it doesn't justify the complexity right now.This diff
Adds
unstable_compactSourceMaps(defaultfalse). When enabled, the transformworker stores each module's source map as a compact VLQ string (
VlqMap)instead of a decoded
Array<MetroSourceMapSegmentTuple>.Each module's map originates from one of three sources, so we encode the VLQ the
cheapest way available in each case (all byte-identical to the decoded-tuple
output):
encode the
VlqMapstraight fromresult.decodedMap, whichbabel/generatorcomputes eagerly while generating, via
vlqMapFromBabelDecodedMap— nevermaterialising tuples.
re-encode the resulting tuples with
vlqMapFromTuples.re-encodes with
vlqMapFromTuples.countLinesis split out ofcountLinesAndTerminateMapso the decoded-map fastpath can compute the terminating mapping without building and terminating a
tuple array first.
Benchmarks
Cold cache (n=3, means)
| Metric | base | compact |
|---|---|---|---|
| Heap used | 1653.7 MB | 809.7 MB (−51.0%) |
| RSS | 1854.2 MB | 955.2 MB (−48.5%) |
| Heap growth (build) | 1606.5 MB | 761.2 MB (−52.6%) |
| Build CPU (
.bundle) | 23.05 s | 22.42 s (n.s.) || Serialize CPU (
.map) | 11.99 s | 14.19 s (+18.4%) |Warm cache (n=3, means)
| Metric | base | compact |
|---|---|---|---|
| Heap used | 1552 MB | 731 MB (−52.9%) |
| RSS | 1775 MB | 923 MB (−48.0%) |
| Build CPU (
.bundle) | 10.92 s | 8.86 s (−18.9%) || Serialize CPU (
.map) | 11.87 s | 13.89 s (+17.0%) |Why behind a flag?
mapstructure is exposed to custom serialisers, so changing it is semver-breaking. Landing this as experimental opt-in in a non-breaking release allows integrators to experiment with it.Changelog:
Differential Revision: D109216060