Cut compile time: de-inline setup helpers + add a small precompile workload#45
Open
lkdvos wants to merge 7 commits into
Open
Cut compile time: de-inline setup helpers + add a small precompile workload#45lkdvos wants to merge 7 commits into
lkdvos wants to merge 7 commits into
Conversation
`permutedims`, `sreshape`, the SliceIndex `getindex`/`sview` view constructors, and the `_computeviewsize`/`_computeviewstrides`/`_computeviewoffset` helpers are all "once-per-operation" setup steps, not hot inner-loop code. Forcing `@inline` on them duplicated their per-N size/stride/offset/permute computation into every downstream specialization and re-inferred it per shape, bloating compile times. Dropping `@inline` lets each compile once per signature and dedup across callers. The hot indexing path is deliberately left inlined: scalar `getindex`/`setindex!` and `_computeind` keep `@inline`, as does the trivial `_normalizeparent` accessor. Measured (Julia 1.12.6): - Downstream TensorOperations dynamic-ncon grid: TTFX 42.1s -> 31.6s (-25%) from de-inlining permutedims/sreshape, with no runtime regression (StridedBLAS vs BaseCopy results agree to 3e-16). - StridedViews-local A/B vs origin/main: view construction 4.35ns -> 4.35ns and the scalar getindex hot loop 20.22us -> 20.24us, i.e. steady-state runtime unchanged for the additionally de-inlined view helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Warm the core `StridedView` specializations for the BLAS element types (`Float32`, `Float64`, `ComplexF32`, `ComplexF64`) over ndims 1:4 plus the 2D transpose/adjoint cases: construction, `permutedims`, `sreshape`, `sview`/slice `getindex`, `conj`, `transpose`/`adjoint`, and `size`/`strides`/`offset`. These are exactly the specializations downstream packages hit on their first call, so caching them removes that first-call latency. The workload is intentionally kept small (BLAS floats, ndims 1:4, identity/conj plus the 2D wrappers) to keep StridedViews' own precompile bounded. Measured (Julia 1.12.6, cold compiled-cache depot): - StridedViews cold precompile: ~0.53s -> ~2.29s (Pkg build line), i.e. ~+1.76s one-time, bounded. - First-call latency of the exercised core ops in a fresh process: ~1.78s -> ~0.027s (~66x), the inference cost being moved into the cached precompile. Bumps version to 0.5.2 and adds PrecompileTools to [deps]/[compat]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Jutho
reviewed
Jun 17, 2026
| return getindex(StridedView(a), I...) | ||
| end | ||
| @inline function sview(a::AbstractArray, I::SliceIndex) | ||
| function sview(a::AbstractArray, I::SliceIndex) |
Member
There was a problem hiding this comment.
I indeed have no idea why any of the above @inlines were here, this didn't make any sense.
Member
|
I'll approve after you finished fighting JET |
Codecov Report❌ Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR removes some of the forced
@inlineannotations onsreshapeandpermutedims, which, combined with the@inlinecalls on the recursive_compute*for the strides and sizes functions meant that these methods have to be compiled in quite a lot of the TensorOperations kernels, for each different combination ofT,N,....This PR just removes the annotation, allowing the compiler to decide when to inline, which seems to have quite a large impact on the actual compilation time in TensorOperations calls.
On top of that, since these functions are not inlined, it now makes sense to add a precompilation workload as well, in an attempt to remove some of the TTFX as well as precompile time burden in TensorOperations.
From what I can measure, it seems to reduce about 25% of the TTFX on a workload in TensorOperations where I exhaustively perform all binary contractions up to
N1,N2,N3 <= 3(open,contracted,open legs).On my machine, precompilation time is order ~2 seconds, so this hurts very little.
The runtime cost at the TensorOperations level is negligible, so I'd say to just merge and release this.
Precompile-time comparison: TensorOperations suite,
mainvs this PRCold-precompiling the TensorOperations precompile workload (enabled, fixed grid
precompile_contract_ndims=[3,2],precompile_add_ndims=3,precompile_trace_ndims=[3,2],eltypes
[Float64, ComplexF64]), back-to-back on Julia 1.12.6:main@compile_workload)The de-inlining stops TensorOperations' contraction specializations from re-inferring the
StridedView stride/permute helpers, so its precompile suite is ~25% cheaper. The added cost
lives in StridedViews' own precompile (+3.0 s, one-time per build, shared by all downstream),
for a net ~19% faster cold precompile of the whole environment.