Skip to content

perf: specialize Builtin1 and Builtin3 apply paths#807

Merged
stephenamar-db merged 1 commit into
databricks:masterfrom
He-Pin:perf/builtin-apply-overrides
Apr 30, 2026
Merged

perf: specialize Builtin1 and Builtin3 apply paths#807
stephenamar-db merged 1 commit into
databricks:masterfrom
He-Pin:perf/builtin-apply-overrides

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 30, 2026

Split out from #763.

Motivation:

Reduce allocation and dispatch overhead when one- and three-argument builtins are called through the dynamic Val.Func.apply1 / apply3 path.

Key Design Decision:

Keep the optimization local and semantics-preserving. Builtin2 already has an exact-arity apply2 override; this adds matching Builtin1.apply1 and Builtin3.apply3 overrides. Exact positional calls directly invoke the structured evalRhs overload and skip constructing an intermediate Array. Non-exact paths still fall back to the generic parent application path.

Correctness:

  • The direct path matches the existing Builtin1.apply / Builtin3.apply exact positional behavior: force the supplied Eval values, then call the typed evalRhs.
  • Named arguments, missing defaults, too many arguments, and other non-exact calls still use the generic function application logic.
  • Static Expr.ApplyBuiltin1 / Expr.ApplyBuiltin3 paths are unchanged; this only helps dynamic builtin calls such as a builtin stored in a local or returned from another function.

Modification:

  • Add Builtin1.apply1.
  • Add Builtin3.apply3.

Validation:

  • ./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].test' (141/141, SUCCESS)
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
  • ./mill --no-server '_.jvm[_].__.test' (1104/1104, SUCCESS)
  • Dynamic builtin smoke checks:
    • local f = std.length; f([1, 2, 3]) -> 3
    • local f = std.substr; f("abcdef", 1, 3) -> "bcd"
    • local f = std.substr; f(str="abcdef", from=1, len=3) -> "bcd"

Hyperfine:

Toolchain:

  • hyperfine 1.20.0
  • --warmup 3 --min-runs 25 for targeted dynamic builtin benchmarks
  • --warmup 3 --min-runs 20 for realistic2
  • JVM assemblies built with ./mill --no-server show 'sjsonnet.jvm[3.3.7].assembly'
  • Base: upstream/master at c04fc804
  • Branch: 2067d8b5

Targeted Builtin1 dynamic call benchmark:

local identity(x) = x;
local len = identity(std.length);
std.foldl(
  function(acc, i) acc + len("abcdef"),
  std.range(1, 5000000),
  0
)
Command Mean [ms] Min [ms] Max [ms] Relative
master builtin1_dynamic 649.8 +/- 48.3 557.4 726.8 1.00
branch builtin1_dynamic 661.7 +/- 41.0 606.1 747.3 1.02 +/- 0.10

Result: statistically neutral in this hyperfine run.

Targeted Builtin3 dynamic call benchmark:

local identity(x) = x;
local substr = identity(std.substr);
std.foldl(
  function(acc, i) acc + std.length(substr("abcdef", 1, 3)),
  std.range(1, 3000000),
  0
)
Command Mean [ms] Min [ms] Max [ms] Relative
master builtin3_dynamic 742.5 +/- 156.1 594.1 1254.7 1.12 +/- 0.30
branch builtin3_dynamic 660.4 +/- 110.9 534.0 962.6 1.00

Result: branch was faster in this run, but variance is high.

End-to-end realistic2:

Command Mean [ms] Min [ms] Max [ms] Relative
master realistic2 544.9 +/- 95.3 414.1 706.5 1.27 +/- 0.27
branch realistic2 428.4 +/- 54.1 378.3 565.8 1.00

Result: branch was faster in this run; due JVM-startup and system noise, treat this as a non-regression signal rather than a guaranteed 1.27x speedup.

Motivation:
Reduce allocation and dispatch overhead when one- and three-argument builtins are called through the dynamic function apply path.

Modification:
Add Builtin1.apply1 and Builtin3.apply3 overrides that directly call their structured evalRhs methods for exact positional arity, matching the existing Builtin2.apply2 specialization and falling back to the generic parent path otherwise.

Result:
Dynamic builtin calls avoid constructing temporary argument arrays on the exact-arity path. JVM tests and targeted hyperfine comparisons pass.
@stephenamar-db stephenamar-db merged commit 192bb33 into databricks:master Apr 30, 2026
5 checks passed
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 30, 2026
Motivation:
Reduce allocation overhead in common numeric rendering paths.

Modification:
1. RenderUtils.renderDouble reuses pre-cached string representations for exact integer doubles in the range 0-255.
2. Materializer.stringify delegates number stringification to RenderUtils.renderDouble, removing its duplicate integer fast path.

Result:
Numeric materialization uses the shared renderDouble fast path. The Builtin1.apply1 / Builtin3.apply3 specialization from the original PR is already present in current master via databricks#807, so it is no longer part of this PR diff.
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request May 8, 2026
Motivation:
Reduce allocation overhead in common numeric rendering paths.

Modification:
1. RenderUtils.renderDouble reuses pre-cached string representations for exact integer doubles in the range 0-255.
2. Materializer.stringify delegates number stringification to RenderUtils.renderDouble, removing its duplicate integer fast path.

Result:
Numeric materialization uses the shared renderDouble fast path. The Builtin1.apply1 / Builtin3.apply3 specialization from the original PR is already present in current master via databricks#807, so it is no longer part of this PR diff.
stephenamar-db pushed a commit that referenced this pull request May 9, 2026
Motivation:

Reduce allocation overhead in common numeric rendering paths.

Key Design Decision:

Keep this PR focused on the `renderDouble` optimization. The original
`Builtin1.apply1` / `Builtin3.apply3` specialization has already landed
in current `master` via #807, so it is intentionally no longer part of
this PR diff after the rebase.

Modification:

- `Materializer` delegates number stringification to
`RenderUtils.renderDouble`.
- `RenderUtils.renderDouble` reuses a small integer string cache for
exact integer doubles in the range `0` to `255`.

Benchmark Results:

Rebased onto latest `upstream/master` at `8b67cb1e`.

JMH, JVM harness, lower is better. Normal/noise-only rows are
intentionally omitted.

| Benchmark | master ms/op | PR ms/op | Delta |
| --- | ---: | ---: | ---: |
| `realistic2` | 47.484 | 41.848 | -11.9% |

Scala Native hyperfine against latest source-built jrsonnet master
`5b43fa8` (`jrsonnet 0.5.0-pre98`), lower is better:

| Benchmark | master-native | PR-native | jrsonnet-source | Result |
| --- | ---: | ---: | ---: | --- |
| `realistic2` | 82.2 +/- 1.5 ms | 85.4 +/- 6.1 ms | 143.7 +/- 4.0 ms |
Native neutral/noisy vs master; PR-native is 1.68x faster than latest
jrsonnet |

Analysis:

The current-master refresh leaves one clear JVM signal: `realistic2`
improves by about 12% through the shared numeric rendering path. Native
CLI is not a standout improvement over master on this PR, but latest
jrsonnet is substantially slower on the same `realistic2` case.

Verification:

- `./mill -i __.checkFormat`
- `./mill -i 'sjsonnet.jvm[3.3.7].test'`
- `./mill -i bench.runRegressions
bench/resources/cpp_suite/large_string_join.jsonnet
bench/resources/cpp_suite/large_string_template.jsonnet
bench/resources/cpp_suite/realistic2.jsonnet
bench/resources/go_suite/parseInt.jsonnet`
- `./mill -i 'sjsonnet.native[3.3.7].nativeLink'`
- `hyperfine --warmup 10 --min-runs 50`

References:

- PR branch: `perf/cached-render-double`
- Base: `upstream/master` at `8b67cb1eeffa764f2a7298658e5473d8402a8da1`
- Head: `b21352970b5f19e8f929a259739d88e69df31627`
- Source-built jrsonnet: `5b43fa88b8c43856dd5a2daa9c5c251153c5e14d`

Result:

Ready. The PR now contains only the cached `renderDouble` work; the
Builtin1/3 apply override work is already covered by #807 on `master`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants