Cut compile time: de-inline setup helpers + add a small precompile workload by lkdvos · Pull Request #45 · QuantumKitHub/StridedViews.jl

lkdvos · 2026-06-17T15:12:42Z

This PR removes some of the forced @inline annotations on sreshape and permutedims, which, combined with the @inline calls on the recursive _compute* for the strides and sizes functions meant that these methods have to be compiled in quite a lot of the TensorOperations kernels, for each different combination of T,N,....
This PR just removes the annotation, allowing the compiler to decide when to inline, which seems to have quite a large impact on the actual compilation time in TensorOperations calls.

On top of that, since these functions are not inlined, it now makes sense to add a precompilation workload as well, in an attempt to remove some of the TTFX as well as precompile time burden in TensorOperations.

From what I can measure, it seems to reduce about 25% of the TTFX on a workload in TensorOperations where I exhaustively perform all binary contractions up to N1,N2,N3 <= 3 (open,contracted,open legs).
On my machine, precompilation time is order ~2 seconds, so this hurts very little.

The runtime cost at the TensorOperations level is negligible, so I'd say to just merge and release this.

Precompile-time comparison: TensorOperations suite, `main` vs this PR

Cold-precompiling the TensorOperations precompile workload (enabled, fixed grid
precompile_contract_ndims=[3,2], precompile_add_ndims=3, precompile_trace_ndims=[3,2],
eltypes [Float64, ComplexF64]), back-to-back on Julia 1.12.6:

precompile of…	StridedViews `main`	StridedViews (this PR)	Δ
TensorOperations (the workload suite)	90.9 s	68.0 s	−22.9 s (−25%)
Strided	1.72 s	1.90 s	+0.2 s (noise)
StridedViews itself	0.59 s	3.59 s	+3.0 s (this PR's `@compile_workload`)
whole environment (cold)	93.3 s	75.4 s	−17.9 s (−19%)

The de-inlining stops TensorOperations' contraction specializations from re-inferring the
StridedView stride/permute helpers, so its precompile suite is ~25% cheaper. The added cost
lives in StridedViews' own precompile (+3.0 s, one-time per build, shared by all downstream),
for a net ~19% faster cold precompile of the whole environment.

`permutedims`, `sreshape`, the SliceIndex `getindex`/`sview` view constructors, and the `_computeviewsize`/`_computeviewstrides`/`_computeviewoffset` helpers are all "once-per-operation" setup steps, not hot inner-loop code. Forcing `@inline` on them duplicated their per-N size/stride/offset/permute computation into every downstream specialization and re-inferred it per shape, bloating compile times. Dropping `@inline` lets each compile once per signature and dedup across callers. The hot indexing path is deliberately left inlined: scalar `getindex`/`setindex!` and `_computeind` keep `@inline`, as does the trivial `_normalizeparent` accessor. Measured (Julia 1.12.6): - Downstream TensorOperations dynamic-ncon grid: TTFX 42.1s -> 31.6s (-25%) from de-inlining permutedims/sreshape, with no runtime regression (StridedBLAS vs BaseCopy results agree to 3e-16). - StridedViews-local A/B vs origin/main: view construction 4.35ns -> 4.35ns and the scalar getindex hot loop 20.22us -> 20.24us, i.e. steady-state runtime unchanged for the additionally de-inlined view helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Warm the core `StridedView` specializations for the BLAS element types (`Float32`, `Float64`, `ComplexF32`, `ComplexF64`) over ndims 1:4 plus the 2D transpose/adjoint cases: construction, `permutedims`, `sreshape`, `sview`/slice `getindex`, `conj`, `transpose`/`adjoint`, and `size`/`strides`/`offset`. These are exactly the specializations downstream packages hit on their first call, so caching them removes that first-call latency. The workload is intentionally kept small (BLAS floats, ndims 1:4, identity/conj plus the 2D wrappers) to keep StridedViews' own precompile bounded. Measured (Julia 1.12.6, cold compiled-cache depot): - StridedViews cold precompile: ~0.53s -> ~2.29s (Pkg build line), i.e. ~+1.76s one-time, bounded. - First-call latency of the exercised core ops in a fresh process: ~1.78s -> ~0.027s (~66x), the inference cost being moved into the cached precompile. Bumps version to 0.5.2 and adds PrecompileTools to [deps]/[compat]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Jutho · 2026-06-17T18:07:04Z

    return getindex(StridedView(a), I...)
 end
-@inline function sview(a::AbstractArray, I::SliceIndex)
+function sview(a::AbstractArray, I::SliceIndex)


I indeed have no idea why any of the above @inlines were here, this didn't make any sense.

Jutho · 2026-06-17T18:08:37Z

I'll approve after you finished fighting JET

codecov · 2026-06-17T18:45:03Z

Codecov Report

❌ Patch coverage is 93.75000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/stridedview.jl	60.00%	2 Missing ⚠️

Files with missing lines	Coverage Δ
src/StridedViews.jl	`100.00% <ø> (ø)`
src/precompile.jl	`100.00% <100.00%> (ø)`
src/stridedview.jl	`28.20% <60.00%> (+14.74%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lkdvos and others added 6 commits June 17, 2026 11:11

restore inline for recursive functions

565c833

increase precompile workload

6937561

remove slop

c798328

bump precompiletools version

4c04cf9

lkdvos force-pushed the ld-compile-time branch from 34adf20 to 4c04cf9 Compare June 17, 2026 15:47

attempt to fix JET

879f83c

Jutho reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cut compile time: de-inline setup helpers + add a small precompile workload#45

Cut compile time: de-inline setup helpers + add a small precompile workload#45
lkdvos wants to merge 7 commits into
mainfrom
ld-compile-time

lkdvos commented Jun 17, 2026 •

edited

Loading

Uh oh!

Jutho Jun 17, 2026

Uh oh!

Jutho commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lkdvos commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Precompile-time comparison: TensorOperations suite, main vs this PR

Uh oh!

Jutho Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Jutho commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lkdvos commented Jun 17, 2026 •

edited

Loading

Precompile-time comparison: TensorOperations suite, `main` vs this PR

codecov Bot commented Jun 17, 2026 •

edited

Loading