Skip to content

[Feature] Improve async DCP checkpoint reliability and early failure …#1859

Open
VincentCheungKokomo wants to merge 1 commit into
InternLM:mainfrom
VincentCheungKokomo:add_async_save_health_check
Open

[Feature] Improve async DCP checkpoint reliability and early failure …#1859
VincentCheungKokomo wants to merge 1 commit into
InternLM:mainfrom
VincentCheungKokomo:add_async_save_health_check

Conversation

@VincentCheungKokomo

@VincentCheungKokomo VincentCheungKokomo commented May 31, 2026

Copy link
Copy Markdown
Contributor

[Feature] Improve async DCP checkpoint reliability and early failure detection

  • Add warmup_async_save_dcp() to pre-initialize checkpoint daemon before training, surfacing EADDRINUSE errors immediately instead of mid-training
  • Add _check_async_save_health() for per-step non-blocking failure detection on both async DCP
  • Coalesce per-tensor shared memory into per-dtype buffers to reduce fd count from ~3000 to ~2, fixing "received 0 items of ancdata" errors
  • Remove retry logic in commit_async_save() (warmup makes it unnecessary)
  • Allow snapshot saves to use the async checkpoint path

@VincentCheungKokomo VincentCheungKokomo force-pushed the add_async_save_health_check branch from d8317a2 to 2c7a355 Compare June 2, 2026 11:54
@VincentCheungKokomo VincentCheungKokomo force-pushed the add_async_save_health_check branch from 2c7a355 to f8c6bd7 Compare June 3, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant