Add Gemma 4 text-decoder export to CoreML by john-rocky · Pull Request #19253 · pytorch/executorch

john-rocky · 2026-05-01T06:03:17Z

Summary

The Gemma 4 text decoder shipped with examples/models/gemma4/text_decoder/
already implements hybrid sliding/full attention, partial RoPE,
per-layer head_dim (256 for sliding / 512 for full), MQA, and YOCO
KV sharing in plain PyTorch.

I checked, and that implementation lowers cleanly through
torch.export and CoreMLPartitioner today — for the synthetic
10-layer Gemma 4 used in the new test, the lowered edge program
contains exactly executorch_call_delegate and getitem at the top
level (1186 MIL ops fully delegated). No portable fallbacks, no
unsupported ops.

So the missing piece is not new modeling code — it is the small amount
of glue that turns "exportable in principle" into "exportable from one
shell command". This PR adds that glue:

examples/apple/coreml/gemma4/export_gemma4_text_decoder_coreml.py,
with sensible CoreML defaults: iOS18+ deployment target so the
YOCO KV caches can be taken over as stateful tensors,
compute_unit=CPU_AND_NE, fp16 by default (the ANE requires fp16).
A --random_weights mode for smoke-testing the export pipeline
without a HuggingFace checkpoint, plus --config_json,
--sliding_window, --sliding_window_pattern overrides.
A readme.md documenting the flags and the "everything delegates"
property.
A BUCK target so the script is buildable in fbcode the same way
the existing CoreML llama scripts are.

The audio and vision encoders are intentionally out of scope — the
existing ATen pipeline in examples/models/gemma4 is more appropriate
for those.

Test plan

examples/apple/coreml/gemma4/test.py builds a 10-layer synthetic
Gemma 4 (4 sliding + 1 full × 2) — same hybrid pattern as Gemma 4 E2B,
just at smaller dimensions — and runs the full export pipeline,
asserting the resulting .pte is non-empty.

$ python -m pytest examples/apple/coreml/gemma4/test.py -v
test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED
test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED
============================== 2 passed in 15.32s ==============================

I also ran the export by hand and confirmed the resulting edge program
is fully delegated.

Relationship to other open PRs

Add --sliding_window flag to CoreML static LLM export #19250 / Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251 add --sliding_window / --sliding_window_pattern
for the static-LLM Llama path. Gemma 4's text decoder uses a
different attention implementation (per-layer head_dim, partial
RoPE, etc.) that already understands those concepts via Gemma4Config,
so this PR doesn't depend on those — it just plumbs the equivalent
overrides through to Gemma4Config directly.
Add coreml_compute_plan.py: report which CoreML ops dispatch to ANE / GPU / CPU #19252 adds coreml_compute_plan.py, which is the natural next step
for tuning a Gemma 4 export: run it against the produced .pte to
see which ops the runtime would dispatch to the ANE vs the CPU.

Authored with Claude.

The Gemma 4 text decoder shipped with examples/models/gemma4 already implements hybrid sliding/full attention, partial RoPE, per-layer head_dim, MQA, and YOCO KV sharing in plain PyTorch. That implementation lowers cleanly through torch.export and CoreMLPartitioner — every node in the resulting edge program is a single executorch_call_delegate and a getitem. This script wires up the small amount of glue needed for an on-device-friendly default: * compile_specs targeting iOS18+ so the YOCO KV caches can be taken over as stateful tensors. * fp16 by default (the ANE requires fp16). * compute_unit=CPU_AND_NE so the runtime is free to keep ops on the ANE. * Optional --random_weights mode for smoke-testing the export without a HuggingFace checkpoint, plus --config_json / --sliding_window / --sliding_window_pattern overrides. Audio and vision encoders are intentionally out of scope here — the existing ATen pipeline in examples/models/gemma4 is more appropriate for those. ### Test plan `test.py` builds a 10-layer synthetic Gemma 4 (4 sliding + 1 full × 2) and runs the full export pipeline, asserting the resulting .pte exists. $ python -m pytest examples/apple/coreml/gemma4/test.py -v test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED ============================== 2 passed in 15.32s ============================== I also ran the export by hand against the synthetic config and confirmed the lowered edge program contains only `executorch_call_delegate` and `getitem` at the top level. Authored with Claude.

pytorch-bot · 2026-05-01T06:03:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19253

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 11 Awaiting Approval

As of commit 4efa007 with merge base 94d2881 ():

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shoumikhin · 2026-05-02T15:29:40Z

Thanks @john-rocky, really appreciate the CoreML batch. Linking the related PRs in this stack so reviewers can see the full picture: #19245, #19246, #19247, #19248, #19249, #19250, #19251, #19252.

@metascroy you're already on this one. Would you mind taking a sweep across the stack, or should we pull in another CoreML reviewer?

john-rocky · 2026-05-02T15:37:43Z

Thanks @shoumikhin! Quick orientation for whoever does the sweep:

Reject CoreML delegation for unsupported input dtypes #19245–19249 are five independent partitioner / DX fixes touching coreml_partitioner.py, torch_ops.py, and the partition tests. Each one stands alone; merge order does not matter.
Add --sliding_window flag to CoreML static LLM export #19250 → Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251 is the only stack: Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251 was branched on top of Add --sliding_window flag to CoreML static LLM export #19250's commit, so once Add --sliding_window flag to CoreML static LLM export #19250 lands, Add per-layer hybrid sliding/full attention (Gemma 3 / Gemma 4) to CoreML static LLM export #19251's diff collapses to just the per-layer commit (9bdf04e). I'll rebase as soon as that happens.
Add coreml_compute_plan.py: report which CoreML ops dispatch to ANE / GPU / CPU #19252 (compute-plan analyzer) and this PR (Add Gemma 4 text-decoder export to CoreML #19253, Gemma 4) are standalone.

All nine have unit tests I ran on macOS 26 / Python 3.10 / coremltools 9.0; the test plan section in each PR body has the local pytest output.

Happy to split, squash, retitle, or release notes: label any of them if that helps land the batch faster — let me know what's most useful.

john-rocky · 2026-05-02T15:37:49Z

@pytorchbot label "release notes: apple"

metascroy · 2026-05-04T18:22:36Z

Thanks for the PR stack @john-rocky! I've started reviewing and leaving comments on the stack.

Is this PR an alternative export path to the static export path you enabled in other PRs? If so, what was the perf difference? I think enabling the non-static path likely belongs as a contribution to the optimum-executorch repo, rather than here: https://github.com/huggingface/optimum-executorch

john-rocky · 2026-05-04T20:33:17Z

Thanks for the careful read @metascroy!

Quick clarification on the placement question — this PR is not an alternative export path that competes with the static one in #19250 / #19251. Gemma 4's text decoder uses a model class that's structurally different from StaticAttention:

Per-layer head_dim (global_head_dim=512 for the full layers, head_dim=256 for the sliding ones)
Partial RoPE on the full layers (only the first 25% of dims rotate)
V-norm
Q/K-norm applied before RoPE

Wiring those into StaticAttention would be a non-trivial set of additions (per-layer head_dim is the most invasive — it propagates through every cache and mask), and that's the work I would normally have done before being able to compare the two paths side-by-side. So I don't have a static-vs-this perf number today, because there is no static path for Gemma 4 yet. If you'd like, I'm very happy to do that work as a follow-up — extend StaticAttention to support those features and then this script becomes the comparison harness.

For the optimum-executorch question: I went with examples/apple/coreml/gemma4/ as a direct mirror of examples/apple/coreml/llama/, since both consume model code shipped under examples/models/<model>/ and apply CoreML-specific glue. If you'd rather this live in huggingface/optimum-executorch, I'm fine to close this PR and move it there — just say the word. The script imports nothing from optimum and none of the work in #19245–#19252 depends on it landing here.

Either way, the underlying observation in this PR — that examples/models/gemma4/text_decoder/ already lowers fully through CoreMLPartitioner (1186 MIL ops, no portable fallbacks) — should be useful regardless of where the wrapper lives.

john-rocky requested a review from metascroy as a code owner May 1, 2026 06:03

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026

pytorch-bot Bot added the release notes: apple Changes to the Apple backend delegate label May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 text-decoder export to CoreML#19253

Add Gemma 4 text-decoder export to CoreML#19253
john-rocky wants to merge 1 commit intopytorch:mainfrom
john-rocky:coreml/gemma4-text-decoder

john-rocky commented May 1, 2026

Uh oh!

pytorch-bot Bot commented May 1, 2026 •

edited

Loading

Uh oh!

shoumikhin commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

metascroy commented May 4, 2026

Uh oh!

john-rocky commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

john-rocky commented May 1, 2026

Summary

Test plan

Relationship to other open PRs

Uh oh!

pytorch-bot Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19253

⚠️ 11 Awaiting Approval

Uh oh!

shoumikhin commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

john-rocky commented May 2, 2026

Uh oh!

metascroy commented May 4, 2026

Uh oh!

john-rocky commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot Bot commented May 1, 2026 •

edited

Loading