fix distributed barrier imbalance in async_save_dcp#1852
Open
VincentCheungKokomo wants to merge 2 commits into
Open
fix distributed barrier imbalance in async_save_dcp#1852VincentCheungKokomo wants to merge 2 commits into
VincentCheungKokomo wants to merge 2 commits into
Conversation
00f88f4 to
d66ff05
Compare
HAOCHENYE
reviewed
May 30, 2026
| # raise (or continue) together. Without this, NFS cache inconsistencies | ||
| # could cause some ranks to raise while others proceed to the barrier, | ||
| # resulting in a deadlock. | ||
| dir_exists = torch.tensor(int(weights_dir.exists() if dist.get_rank() == 0 else 0), dtype=torch.int32) |
Collaborator
There was a problem hiding this comment.
Can we only check if the directory exists on rank0?
Contributor
Author
There was a problem hiding this comment.
This already only calls exists() on rank 0 (via weights_dir.exists() if dist.get_rank() == 0 else 0) — other ranks simply use 0.
The next line dist.broadcast(dir_exists, src=0, group=async_checkpoint_pg) then syncs rank 0's result to all ranks. This way all
ranks get a consistent decision and either all raise or all proceed together, avoiding a deadlock where NFS cache inconsistencies
cause some ranks to raise while others continue to the barrier.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
commit d66ff05
fix(engine): fix distributed barrier imbalance in async_save_dcp
Three correctness fixes for async_save_dcp on multi-rank training:
Add dist.barrier() after rank-0 rmtree+mkdir to prevent non-rank-0
processes from writing into an incomplete directory while rank-0 is
still cleaning up a stale .incomplete dir.
Use dist.all_reduce(MAX) to broadcast the retry/fatal decision so
all ranks agree before raising or retrying. Without this, ranks that
classify the exception differently would deadlock at the barrier.
Only query weights_dir.exists() on rank-0 and broadcast the result
via dist.broadcast, ensuring all ranks raise FileExistsError (or
proceed) together even under NFS cache inconsistency.
commit 7ee1557
[Fix] Fix async DCP checkpoint "received 0 items of ancdata" and add early failure detection
Coalesce per-tensor shared memory into per-dtype buffers to reduce fd count
from ~3000 to ~2 during daemon subprocess handoff, fixing the ancdata bug.
Add warmup_async_save_dcp() to trigger daemon init before training starts,
surfacing port conflicts (EADDRINUSE) immediately instead of mid-training.
Add _check_async_save_health() to detect async save failures within one
step rather than waiting until the next checkpoint interval.
Allow snapshot saves to use async path.