perf: specialize Builtin1 and Builtin3 apply paths by He-Pin · Pull Request #807 · databricks/sjsonnet

He-Pin · 2026-04-30T04:02:44Z

Split out from #763.

Motivation:

Reduce allocation and dispatch overhead when one- and three-argument builtins are called through the dynamic Val.Func.apply1 / apply3 path.

Key Design Decision:

Keep the optimization local and semantics-preserving. Builtin2 already has an exact-arity apply2 override; this adds matching Builtin1.apply1 and Builtin3.apply3 overrides. Exact positional calls directly invoke the structured evalRhs overload and skip constructing an intermediate Array. Non-exact paths still fall back to the generic parent application path.

Correctness:

The direct path matches the existing Builtin1.apply / Builtin3.apply exact positional behavior: force the supplied Eval values, then call the typed evalRhs.
Named arguments, missing defaults, too many arguments, and other non-exact calls still use the generic function application logic.
Static Expr.ApplyBuiltin1 / Expr.ApplyBuiltin3 paths are unchanged; this only helps dynamic builtin calls such as a builtin stored in a local or returned from another function.

Modification:

Add Builtin1.apply1.
Add Builtin3.apply3.

Validation:

./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
./mill --no-server 'sjsonnet.jvm[3.3.7].test' (141/141, SUCCESS)
./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
./mill --no-server '_.jvm[_].__.test' (1104/1104, SUCCESS)
Dynamic builtin smoke checks:
- local f = std.length; f([1, 2, 3]) -> 3
- local f = std.substr; f("abcdef", 1, 3) -> "bcd"
- local f = std.substr; f(str="abcdef", from=1, len=3) -> "bcd"

Hyperfine:

Toolchain:

hyperfine 1.20.0
--warmup 3 --min-runs 25 for targeted dynamic builtin benchmarks
--warmup 3 --min-runs 20 for realistic2
JVM assemblies built with ./mill --no-server show 'sjsonnet.jvm[3.3.7].assembly'
Base: upstream/master at c04fc804
Branch: 2067d8b5

Targeted Builtin1 dynamic call benchmark:

local identity(x) = x;
local len = identity(std.length);
std.foldl(
  function(acc, i) acc + len("abcdef"),
  std.range(1, 5000000),
  0
)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`master builtin1_dynamic`	649.8 +/- 48.3	557.4	726.8	1.00
`branch builtin1_dynamic`	661.7 +/- 41.0	606.1	747.3	1.02 +/- 0.10

Result: statistically neutral in this hyperfine run.

Targeted Builtin3 dynamic call benchmark:

local identity(x) = x;
local substr = identity(std.substr);
std.foldl(
  function(acc, i) acc + std.length(substr("abcdef", 1, 3)),
  std.range(1, 3000000),
  0
)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`master builtin3_dynamic`	742.5 +/- 156.1	594.1	1254.7	1.12 +/- 0.30
`branch builtin3_dynamic`	660.4 +/- 110.9	534.0	962.6	1.00

Result: branch was faster in this run, but variance is high.

End-to-end realistic2:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`master realistic2`	544.9 +/- 95.3	414.1	706.5	1.27 +/- 0.27
`branch realistic2`	428.4 +/- 54.1	378.3	565.8	1.00

Result: branch was faster in this run; due JVM-startup and system noise, treat this as a non-regression signal rather than a guaranteed 1.27x speedup.

Motivation: Reduce allocation and dispatch overhead when one- and three-argument builtins are called through the dynamic function apply path. Modification: Add Builtin1.apply1 and Builtin3.apply3 overrides that directly call their structured evalRhs methods for exact positional arity, matching the existing Builtin2.apply2 specialization and falling back to the generic parent path otherwise. Result: Dynamic builtin calls avoid constructing temporary argument arrays on the exact-arity path. JVM tests and targeted hyperfine comparisons pass.

Motivation: Reduce allocation overhead in common numeric rendering paths. Modification: 1. RenderUtils.renderDouble reuses pre-cached string representations for exact integer doubles in the range 0-255. 2. Materializer.stringify delegates number stringification to RenderUtils.renderDouble, removing its duplicate integer fast path. Result: Numeric materialization uses the shared renderDouble fast path. The Builtin1.apply1 / Builtin3.apply3 specialization from the original PR is already present in current master via databricks#807, so it is no longer part of this PR diff.

Motivation: Reduce allocation overhead in common numeric rendering paths. Key Design Decision: Keep this PR focused on the `renderDouble` optimization. The original `Builtin1.apply1` / `Builtin3.apply3` specialization has already landed in current `master` via #807, so it is intentionally no longer part of this PR diff after the rebase. Modification: - `Materializer` delegates number stringification to `RenderUtils.renderDouble`. - `RenderUtils.renderDouble` reuses a small integer string cache for exact integer doubles in the range `0` to `255`. Benchmark Results: Rebased onto latest `upstream/master` at `8b67cb1e`. JMH, JVM harness, lower is better. Normal/noise-only rows are intentionally omitted. | Benchmark | master ms/op | PR ms/op | Delta | | --- | ---: | ---: | ---: | | `realistic2` | 47.484 | 41.848 | -11.9% | Scala Native hyperfine against latest source-built jrsonnet master `5b43fa8` (`jrsonnet 0.5.0-pre98`), lower is better: | Benchmark | master-native | PR-native | jrsonnet-source | Result | | --- | ---: | ---: | ---: | --- | | `realistic2` | 82.2 +/- 1.5 ms | 85.4 +/- 6.1 ms | 143.7 +/- 4.0 ms | Native neutral/noisy vs master; PR-native is 1.68x faster than latest jrsonnet | Analysis: The current-master refresh leaves one clear JVM signal: `realistic2` improves by about 12% through the shared numeric rendering path. Native CLI is not a standout improvement over master on this PR, but latest jrsonnet is substantially slower on the same `realistic2` case. Verification: - `./mill -i __.checkFormat` - `./mill -i 'sjsonnet.jvm[3.3.7].test'` - `./mill -i bench.runRegressions bench/resources/cpp_suite/large_string_join.jsonnet bench/resources/cpp_suite/large_string_template.jsonnet bench/resources/cpp_suite/realistic2.jsonnet bench/resources/go_suite/parseInt.jsonnet` - `./mill -i 'sjsonnet.native[3.3.7].nativeLink'` - `hyperfine --warmup 10 --min-runs 50` References: - PR branch: `perf/cached-render-double` - Base: `upstream/master` at `8b67cb1eeffa764f2a7298658e5473d8402a8da1` - Head: `b21352970b5f19e8f929a259739d88e69df31627` - Source-built jrsonnet: `5b43fa88b8c43856dd5a2daa9c5c251153c5e14d` Result: Ready. The PR now contains only the cached `renderDouble` work; the Builtin1/3 apply override work is already covered by #807 on `master`.

stephenamar-db merged commit 192bb33 into databricks:master Apr 30, 2026
5 checks passed

He-Pin mentioned this pull request Apr 30, 2026

perf: cache renderDouble for small integers #763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: specialize Builtin1 and Builtin3 apply paths#807

perf: specialize Builtin1 and Builtin3 apply paths#807
stephenamar-db merged 1 commit into
databricks:masterfrom
He-Pin:perf/builtin-apply-overrides

He-Pin commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants