Skip to content

Support EP rollout worker group recovery#1873

Open
YanhuiDua wants to merge 4 commits into
InternLM:mainfrom
YanhuiDua:fix-health-check-part2
Open

Support EP rollout worker group recovery#1873
YanhuiDua wants to merge 4 commits into
InternLM:mainfrom
YanhuiDua:fix-health-check-part2

Conversation

@YanhuiDua

@YanhuiDua YanhuiDua commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator
  1. 支持当ep_size>1时,当一个ep_group中某个worker挂了,会把这个worker所在的ep_group全部重启;
  2. 增强了 shutdown 功能,当一个worker的serverl_url不能访问时,才认为这个worker已经被关闭了,并且进行重启
  3. 增强了重启功能,目前的/health接口并不能很好的判断推理引擎真的ready并且可以接受请求了,所以在重启时,临时增加向推理引擎发送一个/v1/chat/completions请求,该接口返回正常,才认为推理引擎重启成功

说明:目前rollout controller和 rollout health checker的功能耦合比较严重,边界非常不清晰,会在下一个pr (#1877) 进行重构,把关闭/重启等功能从controller移到health checker去

@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/rollout/controller.py
Comment thread xtuner/v1/rl/rollout/controller.py
Comment thread xtuner/v1/rl/rollout/controller.py
Comment thread xtuner/v1/rl/rollout/utils.py
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Review Summary

PR: Support EP rollout worker group recovery

Good feature — when one worker in an EP group fails, the entire group must restart together since EP workers depend on each other. The _build_rank_to_ep_group helper and group-based recovery logic are well structured.

ProduceBatchResult impact: not directly affected — recovery occurs in ensure_workers_healthy_before_training before any batch accounting.

RoutedExperts impact: active workers in a failed peer's EP group are now shut down during recovery. If they hold routed-experts object refs from in-flight requests, those refs will be released at shutdown. No leak risk identified since recovery only runs between training steps (no active generation).

Issues

Critical

  • controller.py:399-404_restart_worker_group doesn't shut down workers after failed health check. Workers are left running and consuming GPU memory. Multiple tests expect shutdown calls here that the code doesn't produce — these tests will likely fail as written.

Warning

  • controller.py:389-393 — URL validation removed. Old code asserted url == expected_url after restart to prevent silent weight-sync failures. New code discards the init() return value entirely. While dist_init_addr makes port reuse likely, this removes a defense-in-depth check.
  • Test/code mismatch for URL-change scenario: test expects [check_health, shutdown, init, shutdown] (no post-init health check), but code unconditionally calls check_health after init.

Nit

  • utils.py:187-193 — The detection-to-recovery gap (health checker marks inactive → next training barrier runs recovery) leaves workers holding GPU memory longer than before. Acceptable design tradeoff, but worth documenting.

Verdict

REQUEST_CHANGES — the missing shutdown in _restart_worker_group is both a resource leak and a test-code mismatch that will cause test failures.

@YanhuiDua YanhuiDua force-pushed the fix-health-check-part2 branch 2 times, most recently from 33e3ae6 to 76e0cb9 Compare June 5, 2026 06:18
@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/rollout/controller.py
Comment thread xtuner/v1/rl/rollout/worker.py
Comment thread xtuner/v1/rl/rollout/controller.py
Comment thread xtuner/v1/rl/rollout/lmdeploy.py
@claude

claude Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Re-review Summary

PR: Support EP rollout worker group recovery

The previous critical issues (missing shutdown after failed recovery, discarded URL validation) have been properly addressed. The new code validates init return values and calls _cleanup_inactive_worker_group on all failure paths.

ProduceBatchResult impact: not affected — recovery runs in ensure_workers_healthy_before_training before batch accounting.

RoutedExperts impact: active workers in a failed peer's EP group are shut down and restarted atomically. Since recovery only runs between training steps (health checker paused), no in-flight routed-experts object refs should be held.

Architecture

The refactored recovery design is sound:

  1. Health checker detects failures (mark inactive only, no cleanup)
  2. Training barrier triggers group recovery (shutdown whole EP group → init → verify inference readiness)
  3. Chat completion probe after restart catches the "server healthy but engine not ready" gap

Remaining Issues

Warning

  • ray.get(init_refs, timeout=ROLLOUT_RAY_GET_TIMEOUT) uses the 5-hour default during recovery. A single hung init stalls the training barrier silently for hours. Consider a shorter recovery-specific timeout.
  • Worker shutdown + server-down wait can cascade to ~6 minutes per worker before giving up. Acceptable but operators should be aware of worst-case recovery time for multi-worker EP groups.

Nit

  • lmdeploy.py timeout change (60s → 600s) is unrelated to EP recovery; consider separating.
  • Log messaging for EP group recovery could be clearer about which ranks are healthy but restarting due to peer failure.

Verdict

APPROVE — the core logic is correct and well-tested. The warnings are about operational visibility, not correctness.

@YanhuiDua YanhuiDua force-pushed the fix-health-check-part2 branch 2 times, most recently from f3f4c54 to 96113f2 Compare June 8, 2026 06:47
@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review and if you think this pr is ready, give it an approve

Comment thread xtuner/v1/rl/rollout/controller.py
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Final Review Summary

PR: Support EP rollout worker group recovery

ProduceBatchResult impact: not affected — recovery runs in ensure_workers_healthy_before_training before batch accounting.

RoutedExperts impact: not affected — recovery only runs between training steps while health checker is paused; no in-flight routed-experts object refs should be held.

Assessment

The critical issues from prior reviews have been properly addressed:

  • ✓ Shutdown/cleanup on all failure paths (_cleanup_inactive_worker_group)
  • ✓ URL validation restored after init with early return on mismatch
  • dist_init_addr preserved in WorkerInfo and passed back on restart
  • ✓ Chat completion probe catches "healthy but not ready" gap

The EP group recovery design is sound — when one worker fails, the entire EP group is restarted atomically with coordinated dist_init_addr.

Remaining Notes (non-blocking)

  • Session URL regression: session_url is no longer populated in WorkerInfo (neither at init nor after recovery). get_rollout_info() will produce empty session URL dicts. Acceptable if session routing is being moved to PR Refine rollout worker health check and recovery lifecycle #1877, but worth confirming.
  • 5-hour ROLLOUT_RAY_GET_TIMEOUT during recovery (from prior review): still applies but is a pre-existing concern, not introduced by this PR.

Verdict

APPROVE — the core logic is correct, well-structured, and all prior critical feedback has been addressed.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Approving — critical issues from prior reviews are resolved. The EP group recovery logic is sound. One non-blocking note about session_url no longer being populated (inline comment).

@YanhuiDua YanhuiDua force-pushed the fix-health-check-part2 branch from 877a2c7 to 5a89670 Compare June 9, 2026 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant