Skip to content

Switch to CompilerCaching.jl#794

Draft
maleadt wants to merge 23 commits into
mainfrom
tb/compilercaching
Draft

Switch to CompilerCaching.jl#794
maleadt wants to merge 23 commits into
mainfrom
tb/compilercaching

Conversation

@maleadt

@maleadt maleadt commented May 12, 2026

Copy link
Copy Markdown
Member

Will be a breaking release. Sadly also drops 1.10.

@vchuravy

Copy link
Copy Markdown
Member

Sadly also drops 1.10.

That's unfortunate and will likely mean that we will have to maintain the prior version of GPUCompiler until there is a new LTS.

@maleadt

maleadt commented May 12, 2026

Copy link
Copy Markdown
Member Author

will likely mean that we will have to maintain the prior version of GPUCompiler until there is a new LTS.

I'm not convinced that's needed. As long as there's no breaking releases in the back-ends, users can simply use different versions of say CUDA.jl 6.x depending on which Julia version they're using. And for critical features I'd rather they request backports over there rather than maintaining multiple compiler stacks.

@vchuravy

Copy link
Copy Markdown
Member

The issue is that for Enzyme we won't be able to drop 1.10 support (due to DiffEq and so forth) and there the situation is less stable than for the GPU backends.

@maleadt

maleadt commented May 13, 2026

Copy link
Copy Markdown
Member Author

Grmbl. I'll try to come up with something that keeps 1.10 working then.

@maleadt maleadt force-pushed the tb/compilercaching branch from 93feac5 to a61b57b Compare June 16, 2026 18:25
maleadt and others added 22 commits June 16, 2026 20:27
Drop the hand-rolled CodeCache, on-disk kernel cache, and various
pre-1.11 compatibility shims; route inference and CI lookup through
CompilerCaching.CacheView with consumer-defined results structs
attached to each CodeInstance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously partitioned only by `typeof(target)`, which collided across
target instances that produce different IR (e.g. different macOS,
SM arch, or CPU features). Folding the full target and params into
the token matches the spirit of `runtime_slug`, while staying within
the inference-determinant scope of the docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds optional `bitcode`/`bitcode!` hooks on the consumer's `results_type`.
When opted in, `emit_function!` reads renamed per-function bitcode from
the cached CI on a hit, and writes it on a miss. Cross-session
persistence rides on package precompilation; a small session-local
assembled-module cache (keyed by `(cache_owner, opaque_pointers)`) keeps
the within-session fast path.

Drops the `runtime_slug` interface — `cache_owner` now subsumes its role
of identifying compatible IR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`emit_function!` now memoizes each `gpu_*` runtime function's renamed,
post-irgen LLVM bitcode on its own `CodeInstance`'s `analysis_results`
when the back-end opts in via the new `bitcode`/`bitcode!` trait pair.
Cross-session persistence rides on package precompilation; the
session-local `_runtime_libs` assembled-module cache keeps repeated
within-session linking cheap.

Gated on `HAS_INTEGRATED_CACHE` so 1.10 falls through to plain compile
+ link (still serviced by `_runtime_libs`). With Metal opted in, a
second kernel compile in the same session — even after
`reset_runtime()` invalidates `_runtime_libs` — completes ~25× faster
than a cold rebuild on 1.12, because each runtime function is now a
parse-and-link instead of a full Julia → LLVM run.

Restores the optimization originally landed in aa4e64d (reverted in
566811d for 1.10 compat); the new version sits behind the same
infrastructure as `cached_compilation`, so the 1.10 path is no longer
load-bearing on the trait being callable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Julia 1.13+, `jl_emit_native_impl` itself sets every `jl_sysimg_gvar`
to a null initializer before returning (aotcompile.cpp:865), leaving
relocation to the caller. On 1.12, `jl_emit_native_impl` instead bakes
session-local pointer values into the initializer via
`literal_static_pointer_val` — so without intervention, the bitcode we
hand to `bitcode!` for caching carries live pointers from the current
session and isn't safe to reload in a future one.

After collecting `gv_to_value` from `jl_get_llvm_gvs` /
`jl_get_llvm_gvs_globals`, immediately reset each tracked GV's
initializer to null. `relocate_gvs!` at the toplevel link step then
re-applies the session-current values regardless of which Julia we're
on, so optimization still sees the resolved constants.

On 1.13+ the null-out is a no-op (Julia already nulled them); on 1.12
this is what makes per-CI runtime bitcode caching genuinely
cross-session-safe for back-ends that pull in `julia.constgv`-touching
runtime functions (CUDA's `gc_pool_alloc`, `box_*`/`unbox_*`, …).
Metal's stubs don't trip this either way.

Verified: 27 `julia.constgv` GVs in Metal's cached runtime-fn bitcode
on 1.12, all with null initializers post-change (was 27/27 non-null).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompilerCaching becomes a strong dep (it loads as an empty shell on 1.10,
so there's no overhead). The `GPUCompilerCompilerCachingExt` extension —
which existed just to wire the parametric `CC.finish!` override that
attaches a `CachedResult{V}` to every inferred CodeInstance — moves into
`jlgen.jl` directly, gated on `HAS_INTEGRATED_CACHE`. Same code, one less
file.

With CompilerCaching available unconditionally we can also drop the inline
copies of its inference machinery from `jlgen.jl`:

- `drive_inference!` on 1.11+ is now a two-line delegation to
  `CompilerCaching.typeinf!` + `get(cache, mi, nothing)`. The 1.10
  implementation (which talks to the per-interpreter `CodeCache`) moves to
  `deprecated.jl`.
- `collect_codeinfos` / `_ci_codeinfo` go away; the single call site in
  `compile_method_instance` calls `CompilerCaching.get_codeinfos` directly.
- `StackedMethodTable` is re-exported from CompilerCaching on 1.11+; the
  1.10 variant (with the older `MethodMatchResult`-shape `findall`) moves
  to `deprecated.jl`.

Net result: ~200 lines deleted from `jlgen.jl`, no behavior change. All
1.10-only code is now in `deprecated.jl`, ready to disappear in one diff
when 1.10 support drops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hand-rolled `CC.finish!` override with `@setup_results GPUInterpreter`
plus a one-line `CompilerCaching.results_type` trait that reads V from the
interpreter's type parameter. Same generated code, less boilerplate.

`drive_inference!` collapses to a one-liner calling `CompilerCaching.typeinf!(interp, mi)`
— which now constructs the CacheView internally and returns the root CI directly,
saving the lookup the old cache-taking form required. The `cache_view(interp)`
helper inside `jlgen.jl` goes away with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rotocol.

The old `bitcode(results)` / `bitcode!(results, bytes)` pair was a single-purpose
hook bolted onto the consumer's results struct. Extending it to memoize more
phases (LLVM IR, intermediate AIR, whatever) meant adding a parallel pair per
phase. Worse, it required the consumer's results struct to be in the loop on
every cache touch — which forced `rtlib.jl` to know about `CompilerCaching` to
do the per-CI lookup that fetched the right results instance.

Replace with a back-end-managed key→bytes protocol on `CompilerJob`:

    cache_get(job::CompilerJob, key::Symbol)               -> Union{Nothing, Vector{UInt8}}
    cache_put!(job::CompilerJob, key::Symbol, ::Vector{UInt8})

GPUCompiler hands the back-end a job + a key (`:llvm_ir` currently — the
post-irgen LLVM bitcode for runtime library functions). The back-end stores it
wherever it likes — typically on a CI's `analysis_results` via CompilerCaching,
but it could equally be an in-memory `Dict`, on-disk storage, or nothing. The
default no-op pair means no caching.

`rtlib.jl` no longer imports `CompilerCaching` — it just calls the hooks. New
phase keys can be added without growing the API surface; back-ends opt in
selectively by matching on keys they care about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the previous V-threaded design (GPUInterpreter{V}, results_type(job),
@setup_results, cache_get/cache_put!, cache_view) with a single back-end-facing
entry point:

    cached_results(::Type{V}, job::CompilerJob)::V

which returns the (lazily created) results struct for a job. Back-ends define
one mutable struct holding their per-stage artifacts, check completeness, and
compile into it — a single code path on all supported Julia versions:

- On 1.11+, the struct lives on the CodeInstance in Julia's integrated cache
  (running inference to create one when needed), wrapped in a config-keyed
  JobResults container. CompilerCaching attaches results lazily, so the
  GPUInterpreter no longer carries a results type, and independent consumers
  (e.g. our own runtime-library cache) can attach to the same CI.
- On 1.10, the struct lives in a session-local Dict keyed by the same job
  identity, kept alongside the other legacy code in deprecated.jl.

Keying results by the full CompilerConfig (not just cache_owner) fixes a
latent bug in the previous design where two jobs differing only in codegen
settings — e.g. the kernel name — would share artifacts. The owner token
still covers only what affects inference, so inference results remain shared
across such jobs.

The runtime library now uses the same mechanism: emit_function! memoizes each
runtime function's renamed bitcode in a RuntimeFunctionResults attached to its
CI, replacing the cache_get/cache_put! protocol. Back-ends no longer opt in:
runtime bitcode persists through precompilation automatically on 1.11+.

Also moves the 1.10 drive_inference! definition from deprecated.jl to
jlgen.jl: its signature references GPUInterpreter, which isn't defined yet
when deprecated.jl is included (1.10 loading was broken on the previous
branch), and fixes the ptx precompile test to construct its cache token from
the standalone package's helper module (the sandbox copy defines a distinct
CompilerParams type, so its token can never match the precompiled CIs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Dynamic construction of the cache token made a cached lookup ~3x slower
than the legacy cached_compilation path (684 vs 234 ns); specialized, it
is now faster (166 vs 182 ns). Instantiations are bounded: one per
back-end (and results type).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A Vector field made the target mutable under jl_egal, so owner tokens
and configs deserialized from package images never matched and cached
kernels were silently recompiled. Use a single --spirv-ext specifier
string instead, mirroring LLVM feature strings (and GCN's features).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Artifacts derived from IR with relocated GVs embed absolute pointers
from the precompilation process. Mark such jobs during output
generation and drop their JobResults entries from an atexit hook, which
runs before jl_write_compiler_output: within-session lookups still hit,
but later sessions recompile instead of loading dangling pointers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CIs deposited by our own precompile workload carry world ages from the
precompilation process and are dead weight in later sessions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The derived runtime config inherited cosmetic fields like name=, so
runtime function artifacts were cached (and persisted) once per kernel
config variation instead of once per cache owner.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@maleadt maleadt force-pushed the tb/compilercaching branch from 806ec79 to a1e6c06 Compare June 16, 2026 18:28
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 37.82772% with 166 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.94%. Comparing base (ea01d8b) to head (a1e6c06).

Files with missing lines Patch % Lines
src/deprecated.jl 0.00% 125 Missing ⚠️
src/jlgen.jl 55.38% 29 Missing ⚠️
src/interface.jl 83.33% 7 Missing ⚠️
src/rtlib.jl 87.50% 3 Missing ⚠️
src/spirv.jl 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
- Coverage   79.02%   73.94%   -5.09%     
==========================================
  Files          25       25              
  Lines        4630     4276     -354     
==========================================
- Hits         3659     3162     -497     
- Misses        971     1114     +143     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants