fix(ltx2): Fix VAE timing regression for large batch sizes by mbohlool · Pull Request #421 · AI-Hypercomputer/maxdiffusion

mbohlool · 2026-06-16T17:09:40Z

Fix VAE timing regression for large batch sizes

Root Cause:
In commit 7b28885, an optimization was added to prevent OOM errors for large batch sizes (batch_size > 2) by batch-sharding the latents and disabling sequential slicing. However, this logic used an elif replicate_vae: block, which caused the explicit replication of VAE weights to be entirely skipped for large batch sizes.

Without explicit weight replication, the XLA SPMD partitioner attempts to match the sharding of the input latents (which are batch-sharded) with the VAE decode computation. Because vae.decode involves fully-replicated noise injection and massive 3D convolutions, XLA heuristically decides to insert enormous amounts of cross-device communication (AllGather/AllReduce) to shard the weights or activations, ballooning the execution time from ~2.8s to ~68.5s for non-upsampled latents.

(Note: For upsampled latents, the memory layout generated by the JIT-compiled upsampler bypasses this XLA heuristic trap, allowing it to execute quickly in ~1.5s, which masked the issue).

Fix:
This PR decouples the batch-sharding of latents from the replication of VAE weights. It explicitly applies a full replication constraint NamedSharding(mesh, P()) to the VAE weights in all cases where replicate_vae is True, even if latents are batch-sharded. This forces XLA into the optimal data-parallel compilation path, restoring the fast ~1.5s - ~2.8s execution time for all scenarios without risking the concatenation-related OOMs.

github-actions · 2026-06-16T17:09:54Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

Fixes a massive execution time regression (68s) in VAE decode by explicitly replicating VAE weights even when latents are batch-sharded. This forces XLA into an optimal data-parallel path, restoring the fast ~2s execution time while retaining batch-sharding OOM protections.

mbohlool requested a review from entrpn as a code owner June 16, 2026 17:09

mbohlool requested a review from prishajain1 June 16, 2026 17:11

mbohlool force-pushed the fix-vae-timing-regression branch from 27eda58 to 96e9d82 Compare June 16, 2026 17:13

prishajain1 approved these changes Jun 17, 2026

View reviewed changes

github-actions Bot added the pull ready label Jun 17, 2026

copybara-service Bot merged commit 9616d1c into main Jun 17, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ltx2): Fix VAE timing regression for large batch sizes#421

fix(ltx2): Fix VAE timing regression for large batch sizes#421
copybara-service[bot] merged 1 commit into
mainfrom
fix-vae-timing-regression

mbohlool commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbohlool commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants