Skip to content

Fix DeepCompile ZeRO-1 grad target lifetime#8036

Open
tohtana wants to merge 7 commits into
deepspeedai:masterfrom
tohtana:tohtana/fix/deepcompile-zero1-flat-release
Open

Fix DeepCompile ZeRO-1 grad target lifetime#8036
tohtana wants to merge 7 commits into
deepspeedai:masterfrom
tohtana:tohtana/fix/deepcompile-zero1-flat-release

Conversation

@tohtana
Copy link
Copy Markdown
Collaborator

@tohtana tohtana commented May 29, 2026

DeepCompile ZeRO-1 kept compile-time reduce target buffers alive into the optimizer step, causing backward gradient storage to overlap with optimizer temporaries. This PR fixes the issue by making DeepCompile ZeRO-1 reduce targets follow the normal step-local ZeRO partition gradient-buffer lifetime, instead of preserving cloned target storage from compile setup.

The actual code changes are:

  • During compile initialization, register empty DeepCompile ZeRO-1 gradient targets, then bind them to the step-local flat ZeRO partition gradient buffer and per-parameter views when gradients are ready to synchronize.
  • After ZeRO-1 builds the optimizer-facing fp32 gradient partition, release the DeepCompile registry references and clear reduce bucket storage after backward synchronization.
Step 10-30 avg sec Peak alloc GiB
Without this PR 0.858 43.594
Without this PR 0.859 39.366

Fine-tuning style training (8xH100, Qwen3-8B random weights, bs/GPU=1, seq=4096, GAS=1) showed only finite value losses for 1000 steps.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana marked this pull request as ready for review May 29, 2026 22:08
@tohtana tohtana requested review from loadams and tjruwase as code owners May 29, 2026 22:08
@tohtana tohtana requested a review from eternalNight May 29, 2026 22:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1947a25a80

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread csrc/includes/deepcompile.h Outdated
@tohtana tohtana marked this pull request as draft May 29, 2026 23:40
tohtana and others added 6 commits May 29, 2026 23:28
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…1-grad-buffer-protocol

Use generic flat grad partition protocol
@tohtana tohtana marked this pull request as ready for review June 2, 2026 19:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f98f7290c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread deepspeed/compile/fx.py
Comment on lines +24 to +25
backward_graph_ids = [g_id for g_id, needs_backward in graph_order if needs_backward]
return not backward_graph_ids or graph_id == backward_graph_ids[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Release buckets on the last reducing backward graph

When the first graph in forward order needs backward but has no parameter reductions (for example a graph-break segment that only computes activation/input gradients), this helper marks that no-reduce graph as the one that should clear the shared buckets; however add_end_backward() immediately returns for graphs with no reduce_grad nodes, so no dc.end_backward(..., release_reduce_buckets=True) is inserted anywhere. Any earlier backward graph that did reduce gradients will flush/synchronize but keep reduce_buckets allocated into the optimizer step, defeating the lifetime fix in these multi-graph cases. Please base the release decision on the last backward graph that actually has reduce nodes, or otherwise insert a final cleanup even for no-reduce backward graphs.

Useful? React with 👍 / 👎.

if idx in optimizer._deepcompile_z1_current_grad_buffers:
optimizer._deepcompile_z1_current_grad_buffers[idx] = None

optimizer._deepcompile_z1_release_grad_buffers = release_grad_buffer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed something, but I don't find any reference to optimizer._deepcompile_z1_release_grad_buffers in the latest changes. Is it a dangling change after the refinement since the first draft?

Similarly, optimizer._deepcompile_z1_grad_buffer_metadata seems out-dated as well. Meanwhile, optimizer._deepcompile_z1_current_grad_buffers is still used by the unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants