Skip to content

Add Gemma 4 text-decoder export to CoreML#19253

Open
john-rocky wants to merge 1 commit intopytorch:mainfrom
john-rocky:coreml/gemma4-text-decoder
Open

Add Gemma 4 text-decoder export to CoreML#19253
john-rocky wants to merge 1 commit intopytorch:mainfrom
john-rocky:coreml/gemma4-text-decoder

Conversation

@john-rocky
Copy link
Copy Markdown

Summary

The Gemma 4 text decoder shipped with examples/models/gemma4/text_decoder/
already implements hybrid sliding/full attention, partial RoPE,
per-layer head_dim (256 for sliding / 512 for full), MQA, and YOCO
KV sharing in plain PyTorch.

I checked, and that implementation lowers cleanly through
torch.export and CoreMLPartitioner today
— for the synthetic
10-layer Gemma 4 used in the new test, the lowered edge program
contains exactly executorch_call_delegate and getitem at the top
level (1186 MIL ops fully delegated). No portable fallbacks, no
unsupported ops.

So the missing piece is not new modeling code — it is the small amount
of glue that turns "exportable in principle" into "exportable from one
shell command". This PR adds that glue:

  • examples/apple/coreml/gemma4/export_gemma4_text_decoder_coreml.py,
    with sensible CoreML defaults: iOS18+ deployment target so the
    YOCO KV caches can be taken over as stateful tensors,
    compute_unit=CPU_AND_NE, fp16 by default (the ANE requires fp16).
  • A --random_weights mode for smoke-testing the export pipeline
    without a HuggingFace checkpoint, plus --config_json,
    --sliding_window, --sliding_window_pattern overrides.
  • A readme.md documenting the flags and the "everything delegates"
    property.
  • A BUCK target so the script is buildable in fbcode the same way
    the existing CoreML llama scripts are.

The audio and vision encoders are intentionally out of scope — the
existing ATen pipeline in examples/models/gemma4 is more appropriate
for those.

Test plan

examples/apple/coreml/gemma4/test.py builds a 10-layer synthetic
Gemma 4 (4 sliding + 1 full × 2) — same hybrid pattern as Gemma 4 E2B,
just at smaller dimensions — and runs the full export pipeline,
asserting the resulting .pte is non-empty.

$ python -m pytest examples/apple/coreml/gemma4/test.py -v
test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED
test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED
============================== 2 passed in 15.32s ==============================

I also ran the export by hand and confirmed the resulting edge program
is fully delegated.

Relationship to other open PRs

Authored with Claude.

The Gemma 4 text decoder shipped with examples/models/gemma4 already
implements hybrid sliding/full attention, partial RoPE, per-layer
head_dim, MQA, and YOCO KV sharing in plain PyTorch.  That
implementation lowers cleanly through torch.export and
CoreMLPartitioner — every node in the resulting edge program is a
single executorch_call_delegate and a getitem.  This script wires up
the small amount of glue needed for an on-device-friendly default:

* compile_specs targeting iOS18+ so the YOCO KV caches can be taken
  over as stateful tensors.
* fp16 by default (the ANE requires fp16).
* compute_unit=CPU_AND_NE so the runtime is free to keep ops on the
  ANE.
* Optional --random_weights mode for smoke-testing the export
  without a HuggingFace checkpoint, plus --config_json /
  --sliding_window / --sliding_window_pattern overrides.

Audio and vision encoders are intentionally out of scope here — the
existing ATen pipeline in examples/models/gemma4 is more appropriate
for those.

### Test plan

`test.py` builds a 10-layer synthetic Gemma 4 (4 sliding + 1 full
× 2) and runs the full export pipeline, asserting the resulting .pte
exists.

    $ python -m pytest examples/apple/coreml/gemma4/test.py -v
    test.py::TestGemma4CoreMLExport::test_eager_forward_runs PASSED
    test.py::TestGemma4CoreMLExport::test_full_export_pipeline_lowers_to_coreml PASSED
    ============================== 2 passed in 15.32s ==============================

I also ran the export by hand against the synthetic config and
confirmed the lowered edge program contains only `executorch_call_delegate`
and `getitem` at the top level.

Authored with Claude.
@john-rocky john-rocky requested a review from metascroy as a code owner May 1, 2026 06:03
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19253

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 11 Awaiting Approval

As of commit 4efa007 with merge base 94d2881 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026
@shoumikhin
Copy link
Copy Markdown
Contributor

Thanks @john-rocky, really appreciate the CoreML batch. Linking the related PRs in this stack so reviewers can see the full picture: #19245, #19246, #19247, #19248, #19249, #19250, #19251, #19252.

@metascroy you're already on this one. Would you mind taking a sweep across the stack, or should we pull in another CoreML reviewer?

@john-rocky
Copy link
Copy Markdown
Author

Thanks @shoumikhin! Quick orientation for whoever does the sweep:

All nine have unit tests I ran on macOS 26 / Python 3.10 / coremltools 9.0; the test plan section in each PR body has the local pytest output.

Happy to split, squash, retitle, or release notes: label any of them if that helps land the batch faster — let me know what's most useful.

@john-rocky
Copy link
Copy Markdown
Author

@pytorchbot label "release notes: apple"

@pytorch-bot pytorch-bot Bot added the release notes: apple Changes to the Apple backend delegate label May 2, 2026
@metascroy
Copy link
Copy Markdown
Contributor

Thanks for the PR stack @john-rocky! I've started reviewing and leaving comments on the stack.

Is this PR an alternative export path to the static export path you enabled in other PRs? If so, what was the perf difference? I think enabling the non-static path likely belongs as a contribution to the optimum-executorch repo, rather than here: https://github.com/huggingface/optimum-executorch

@john-rocky
Copy link
Copy Markdown
Author

Thanks for the careful read @metascroy!

Quick clarification on the placement question — this PR is not an alternative export path that competes with the static one in #19250 / #19251. Gemma 4's text decoder uses a model class that's structurally different from StaticAttention:

  • Per-layer head_dim (global_head_dim=512 for the full layers, head_dim=256 for the sliding ones)
  • Partial RoPE on the full layers (only the first 25% of dims rotate)
  • V-norm
  • Q/K-norm applied before RoPE

Wiring those into StaticAttention would be a non-trivial set of additions (per-layer head_dim is the most invasive — it propagates through every cache and mask), and that's the work I would normally have done before being able to compare the two paths side-by-side. So I don't have a static-vs-this perf number today, because there is no static path for Gemma 4 yet. If you'd like, I'm very happy to do that work as a follow-up — extend StaticAttention to support those features and then this script becomes the comparison harness.

For the optimum-executorch question: I went with examples/apple/coreml/gemma4/ as a direct mirror of examples/apple/coreml/llama/, since both consume model code shipped under examples/models/<model>/ and apply CoreML-specific glue. If you'd rather this live in huggingface/optimum-executorch, I'm fine to close this PR and move it there — just say the word. The script imports nothing from optimum and none of the work in #19245#19252 depends on it landing here.

Either way, the underlying observation in this PR — that examples/models/gemma4/text_decoder/ already lowers fully through CoreMLPartitioner (1186 MIL ops, no portable fallbacks) — should be useful regardless of where the wrapper lives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: apple Changes to the Apple backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants