fix distributed barrier imbalance in async_save_dcp by VincentCheungKokomo · Pull Request #1852 · InternLM/xtuner

VincentCheungKokomo · 2026-05-28T13:16:35Z

commit d66ff05
fix(engine): fix distributed barrier imbalance in async_save_dcp

Three correctness fixes for async_save_dcp on multi-rank training:

Add dist.barrier() after rank-0 rmtree+mkdir to prevent non-rank-0
processes from writing into an incomplete directory while rank-0 is
still cleaning up a stale .incomplete dir.
Use dist.all_reduce(MAX) to broadcast the retry/fatal decision so
all ranks agree before raising or retrying. Without this, ranks that
classify the exception differently would deadlock at the barrier.
Only query weights_dir.exists() on rank-0 and broadcast the result
via dist.broadcast, ensuring all ranks raise FileExistsError (or
proceed) together even under NFS cache inconsistency.

commit 7ee1557
[Fix] Fix async DCP checkpoint "received 0 items of ancdata" and add early failure detection

Coalesce per-tensor shared memory into per-dtype buffers to reduce fd count
from ~3000 to ~2 during daemon subprocess handoff, fixing the ancdata bug.
Add warmup_async_save_dcp() to trigger daemon init before training starts,
surfacing port conflicts (EADDRINUSE) immediately instead of mid-training.
Add _check_async_save_health() to detect async save failures within one
step rather than waiting until the next checkpoint interval.
Allow snapshot saves to use async path.

HAOCHENYE · 2026-05-30T11:29:32Z

+        # raise (or continue) together. Without this, NFS cache inconsistencies
+        # could cause some ranks to raise while others proceed to the barrier,
+        # resulting in a deadlock.
+        dir_exists = torch.tensor(int(weights_dir.exists() if dist.get_rank() == 0 else 0), dtype=torch.int32)


Can we only check if the directory exists on rank0?

This already only calls exists() on rank 0 (via weights_dir.exists() if dist.get_rank() == 0 else 0) — other ranks simply use 0.
The next line dist.broadcast(dir_exists, src=0, group=async_checkpoint_pg) then syncs rank 0's result to all ranks. This way all
ranks get a consistent decision and either all raise or all proceed together, avoiding a deadlock where NFS cache inconsistencies
cause some ranks to raise while others continue to the barrier.

fix distributed barrier imbalance in async_save_dcp

d66ff05

VincentCheungKokomo force-pushed the fix_distributed_barrier_imbalance branch from 00f88f4 to d66ff05 Compare May 28, 2026 13:34

HAOCHENYE reviewed May 30, 2026

View reviewed changes

add async DCP checkpoint early failure detection

7ee1557

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix distributed barrier imbalance in async_save_dcp#1852

fix distributed barrier imbalance in async_save_dcp#1852
VincentCheungKokomo wants to merge 2 commits into
InternLM:mainfrom
VincentCheungKokomo:fix_distributed_barrier_imbalance

VincentCheungKokomo commented May 28, 2026 •

edited

Loading

Uh oh!

HAOCHENYE May 30, 2026

Uh oh!

VincentCheungKokomo May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VincentCheungKokomo commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HAOCHENYE May 30, 2026

Choose a reason for hiding this comment

Uh oh!

VincentCheungKokomo May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VincentCheungKokomo commented May 28, 2026 •

edited

Loading

VincentCheungKokomo May 30, 2026 •

edited

Loading