Fix DeepCompile ZeRO-1 grad target lifetime#8036
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1947a25a80
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…1-grad-buffer-protocol Use generic flat grad partition protocol
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f98f7290c0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| backward_graph_ids = [g_id for g_id, needs_backward in graph_order if needs_backward] | ||
| return not backward_graph_ids or graph_id == backward_graph_ids[0] |
There was a problem hiding this comment.
Release buckets on the last reducing backward graph
When the first graph in forward order needs backward but has no parameter reductions (for example a graph-break segment that only computes activation/input gradients), this helper marks that no-reduce graph as the one that should clear the shared buckets; however add_end_backward() immediately returns for graphs with no reduce_grad nodes, so no dc.end_backward(..., release_reduce_buckets=True) is inserted anywhere. Any earlier backward graph that did reduce gradients will flush/synchronize but keep reduce_buckets allocated into the optimizer step, defeating the lifetime fix in these multi-graph cases. Please base the release decision on the last backward graph that actually has reduce nodes, or otherwise insert a final cleanup even for no-reduce backward graphs.
Useful? React with 👍 / 👎.
| if idx in optimizer._deepcompile_z1_current_grad_buffers: | ||
| optimizer._deepcompile_z1_current_grad_buffers[idx] = None | ||
|
|
||
| optimizer._deepcompile_z1_release_grad_buffers = release_grad_buffer |
There was a problem hiding this comment.
I may have missed something, but I don't find any reference to optimizer._deepcompile_z1_release_grad_buffers in the latest changes. Is it a dangling change after the refinement since the first draft?
Similarly, optimizer._deepcompile_z1_grad_buffer_metadata seems out-dated as well. Meanwhile, optimizer._deepcompile_z1_current_grad_buffers is still used by the unit tests.
DeepCompile ZeRO-1 kept compile-time reduce target buffers alive into the optimizer step, causing backward gradient storage to overlap with optimizer temporaries. This PR fixes the issue by making DeepCompile ZeRO-1 reduce targets follow the normal step-local ZeRO partition gradient-buffer lifetime, instead of preserving cloned target storage from compile setup.
The actual code changes are:
Fine-tuning style training (8xH100, Qwen3-8B random weights, bs/GPU=1, seq=4096, GAS=1) showed only finite value losses for 1000 steps.