Switch to CompilerCaching.jl#794
Conversation
That's unfortunate and will likely mean that we will have to maintain the prior version of GPUCompiler until there is a new LTS. |
I'm not convinced that's needed. As long as there's no breaking releases in the back-ends, users can simply use different versions of say CUDA.jl 6.x depending on which Julia version they're using. And for critical features I'd rather they request backports over there rather than maintaining multiple compiler stacks. |
|
The issue is that for Enzyme we won't be able to drop 1.10 support (due to DiffEq and so forth) and there the situation is less stable than for the GPU backends. |
|
Grmbl. I'll try to come up with something that keeps 1.10 working then. |
93feac5 to
a61b57b
Compare
Drop the hand-rolled CodeCache, on-disk kernel cache, and various pre-1.11 compatibility shims; route inference and CI lookup through CompilerCaching.CacheView with consumer-defined results structs attached to each CodeInstance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously partitioned only by `typeof(target)`, which collided across target instances that produce different IR (e.g. different macOS, SM arch, or CPU features). Folding the full target and params into the token matches the spirit of `runtime_slug`, while staying within the inference-determinant scope of the docstring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds optional `bitcode`/`bitcode!` hooks on the consumer's `results_type`. When opted in, `emit_function!` reads renamed per-function bitcode from the cached CI on a hit, and writes it on a miss. Cross-session persistence rides on package precompilation; a small session-local assembled-module cache (keyed by `(cache_owner, opaque_pointers)`) keeps the within-session fast path. Drops the `runtime_slug` interface — `cache_owner` now subsumes its role of identifying compatible IR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emit_function!` now memoizes each `gpu_*` runtime function's renamed, post-irgen LLVM bitcode on its own `CodeInstance`'s `analysis_results` when the back-end opts in via the new `bitcode`/`bitcode!` trait pair. Cross-session persistence rides on package precompilation; the session-local `_runtime_libs` assembled-module cache keeps repeated within-session linking cheap. Gated on `HAS_INTEGRATED_CACHE` so 1.10 falls through to plain compile + link (still serviced by `_runtime_libs`). With Metal opted in, a second kernel compile in the same session — even after `reset_runtime()` invalidates `_runtime_libs` — completes ~25× faster than a cold rebuild on 1.12, because each runtime function is now a parse-and-link instead of a full Julia → LLVM run. Restores the optimization originally landed in aa4e64d (reverted in 566811d for 1.10 compat); the new version sits behind the same infrastructure as `cached_compilation`, so the 1.10 path is no longer load-bearing on the trait being callable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Julia 1.13+, `jl_emit_native_impl` itself sets every `jl_sysimg_gvar` to a null initializer before returning (aotcompile.cpp:865), leaving relocation to the caller. On 1.12, `jl_emit_native_impl` instead bakes session-local pointer values into the initializer via `literal_static_pointer_val` — so without intervention, the bitcode we hand to `bitcode!` for caching carries live pointers from the current session and isn't safe to reload in a future one. After collecting `gv_to_value` from `jl_get_llvm_gvs` / `jl_get_llvm_gvs_globals`, immediately reset each tracked GV's initializer to null. `relocate_gvs!` at the toplevel link step then re-applies the session-current values regardless of which Julia we're on, so optimization still sees the resolved constants. On 1.13+ the null-out is a no-op (Julia already nulled them); on 1.12 this is what makes per-CI runtime bitcode caching genuinely cross-session-safe for back-ends that pull in `julia.constgv`-touching runtime functions (CUDA's `gc_pool_alloc`, `box_*`/`unbox_*`, …). Metal's stubs don't trip this either way. Verified: 27 `julia.constgv` GVs in Metal's cached runtime-fn bitcode on 1.12, all with null initializers post-change (was 27/27 non-null). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompilerCaching becomes a strong dep (it loads as an empty shell on 1.10,
so there's no overhead). The `GPUCompilerCompilerCachingExt` extension —
which existed just to wire the parametric `CC.finish!` override that
attaches a `CachedResult{V}` to every inferred CodeInstance — moves into
`jlgen.jl` directly, gated on `HAS_INTEGRATED_CACHE`. Same code, one less
file.
With CompilerCaching available unconditionally we can also drop the inline
copies of its inference machinery from `jlgen.jl`:
- `drive_inference!` on 1.11+ is now a two-line delegation to
`CompilerCaching.typeinf!` + `get(cache, mi, nothing)`. The 1.10
implementation (which talks to the per-interpreter `CodeCache`) moves to
`deprecated.jl`.
- `collect_codeinfos` / `_ci_codeinfo` go away; the single call site in
`compile_method_instance` calls `CompilerCaching.get_codeinfos` directly.
- `StackedMethodTable` is re-exported from CompilerCaching on 1.11+; the
1.10 variant (with the older `MethodMatchResult`-shape `findall`) moves
to `deprecated.jl`.
Net result: ~200 lines deleted from `jlgen.jl`, no behavior change. All
1.10-only code is now in `deprecated.jl`, ready to disappear in one diff
when 1.10 support drops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled `CC.finish!` override with `@setup_results GPUInterpreter` plus a one-line `CompilerCaching.results_type` trait that reads V from the interpreter's type parameter. Same generated code, less boilerplate. `drive_inference!` collapses to a one-liner calling `CompilerCaching.typeinf!(interp, mi)` — which now constructs the CacheView internally and returns the root CI directly, saving the lookup the old cache-taking form required. The `cache_view(interp)` helper inside `jlgen.jl` goes away with it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rotocol.
The old `bitcode(results)` / `bitcode!(results, bytes)` pair was a single-purpose
hook bolted onto the consumer's results struct. Extending it to memoize more
phases (LLVM IR, intermediate AIR, whatever) meant adding a parallel pair per
phase. Worse, it required the consumer's results struct to be in the loop on
every cache touch — which forced `rtlib.jl` to know about `CompilerCaching` to
do the per-CI lookup that fetched the right results instance.
Replace with a back-end-managed key→bytes protocol on `CompilerJob`:
cache_get(job::CompilerJob, key::Symbol) -> Union{Nothing, Vector{UInt8}}
cache_put!(job::CompilerJob, key::Symbol, ::Vector{UInt8})
GPUCompiler hands the back-end a job + a key (`:llvm_ir` currently — the
post-irgen LLVM bitcode for runtime library functions). The back-end stores it
wherever it likes — typically on a CI's `analysis_results` via CompilerCaching,
but it could equally be an in-memory `Dict`, on-disk storage, or nothing. The
default no-op pair means no caching.
`rtlib.jl` no longer imports `CompilerCaching` — it just calls the hooks. New
phase keys can be added without growing the API surface; back-ends opt in
selectively by matching on keys they care about.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the previous V-threaded design (GPUInterpreter{V}, results_type(job),
@setup_results, cache_get/cache_put!, cache_view) with a single back-end-facing
entry point:
cached_results(::Type{V}, job::CompilerJob)::V
which returns the (lazily created) results struct for a job. Back-ends define
one mutable struct holding their per-stage artifacts, check completeness, and
compile into it — a single code path on all supported Julia versions:
- On 1.11+, the struct lives on the CodeInstance in Julia's integrated cache
(running inference to create one when needed), wrapped in a config-keyed
JobResults container. CompilerCaching attaches results lazily, so the
GPUInterpreter no longer carries a results type, and independent consumers
(e.g. our own runtime-library cache) can attach to the same CI.
- On 1.10, the struct lives in a session-local Dict keyed by the same job
identity, kept alongside the other legacy code in deprecated.jl.
Keying results by the full CompilerConfig (not just cache_owner) fixes a
latent bug in the previous design where two jobs differing only in codegen
settings — e.g. the kernel name — would share artifacts. The owner token
still covers only what affects inference, so inference results remain shared
across such jobs.
The runtime library now uses the same mechanism: emit_function! memoizes each
runtime function's renamed bitcode in a RuntimeFunctionResults attached to its
CI, replacing the cache_get/cache_put! protocol. Back-ends no longer opt in:
runtime bitcode persists through precompilation automatically on 1.11+.
Also moves the 1.10 drive_inference! definition from deprecated.jl to
jlgen.jl: its signature references GPUInterpreter, which isn't defined yet
when deprecated.jl is included (1.10 loading was broken on the previous
branch), and fixes the ptx precompile test to construct its cache token from
the standalone package's helper module (the sandbox copy defines a distinct
CompilerParams type, so its token can never match the precompiled CIs).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Dynamic construction of the cache token made a cached lookup ~3x slower than the legacy cached_compilation path (684 vs 234 ns); specialized, it is now faster (166 vs 182 ns). Instantiations are bounded: one per back-end (and results type). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A Vector field made the target mutable under jl_egal, so owner tokens and configs deserialized from package images never matched and cached kernels were silently recompiled. Use a single --spirv-ext specifier string instead, mirroring LLVM feature strings (and GCN's features). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Artifacts derived from IR with relocated GVs embed absolute pointers from the precompilation process. Mark such jobs during output generation and drop their JobResults entries from an atexit hook, which runs before jl_write_compiler_output: within-session lookups still hit, but later sessions recompile instead of loading dangling pointers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CIs deposited by our own precompile workload carry world ages from the precompilation process and are dead weight in later sessions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The derived runtime config inherited cosmetic fields like name=, so runtime function artifacts were cached (and persisted) once per kernel config variation instead of once per cache owner. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
806ec79 to
a1e6c06
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 79.02% 73.94% -5.09%
==========================================
Files 25 25
Lines 4630 4276 -354
==========================================
- Hits 3659 3162 -497
- Misses 971 1114 +143 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Will be a breaking release. Sadly also drops 1.10.