Support EP rollout worker group recovery by YanhuiDua · Pull Request #1873 · InternLM/xtuner

YanhuiDua · 2026-06-04T08:28:11Z

支持当ep_size>1时，当一个ep_group中某个worker挂了，会把这个worker所在的ep_group全部重启；
增强了 shutdown 功能，当一个worker的serverl_url不能访问时，才认为这个worker已经被关闭了，并且进行重启
增强了重启功能，目前的/health接口并不能很好的判断推理引擎真的ready并且可以接受请求了，所以在重启时，临时增加向推理引擎发送一个/v1/chat/completions请求，该接口返回正常，才认为推理引擎重启成功

说明：目前rollout controller和 rollout health checker的功能耦合比较严重，边界非常不清晰，会在下一个pr (#1877) 进行重构，把关闭/重启等功能从controller移到health checker去

YanhuiDua · 2026-06-04T08:28:21Z

@claude review

claude · 2026-06-04T08:35:18Z

Claude: ## Review Summary

PR: Support EP rollout worker group recovery

Good feature — when one worker in an EP group fails, the entire group must restart together since EP workers depend on each other. The _build_rank_to_ep_group helper and group-based recovery logic are well structured.

ProduceBatchResult impact: not directly affected — recovery occurs in ensure_workers_healthy_before_training before any batch accounting.

RoutedExperts impact: active workers in a failed peer's EP group are now shut down during recovery. If they hold routed-experts object refs from in-flight requests, those refs will be released at shutdown. No leak risk identified since recovery only runs between training steps (no active generation).

Issues

Critical

controller.py:399-404 — _restart_worker_group doesn't shut down workers after failed health check. Workers are left running and consuming GPU memory. Multiple tests expect shutdown calls here that the code doesn't produce — these tests will likely fail as written.

Warning

controller.py:389-393 — URL validation removed. Old code asserted url == expected_url after restart to prevent silent weight-sync failures. New code discards the init() return value entirely. While dist_init_addr makes port reuse likely, this removes a defense-in-depth check.
Test/code mismatch for URL-change scenario: test expects [check_health, shutdown, init, shutdown] (no post-init health check), but code unconditionally calls check_health after init.

Nit

utils.py:187-193 — The detection-to-recovery gap (health checker marks inactive → next training barrier runs recovery) leaves workers holding GPU memory longer than before. Acceptable design tradeoff, but worth documenting.

Verdict

REQUEST_CHANGES — the missing shutdown in _restart_worker_group is both a resource leak and a test-code mismatch that will cause test failures.

YanhuiDua · 2026-06-05T06:19:20Z

@claude review

claude · 2026-06-05T06:23:56Z

Claude: ## Re-review Summary

PR: Support EP rollout worker group recovery

The previous critical issues (missing shutdown after failed recovery, discarded URL validation) have been properly addressed. The new code validates init return values and calls _cleanup_inactive_worker_group on all failure paths.

ProduceBatchResult impact: not affected — recovery runs in ensure_workers_healthy_before_training before batch accounting.

RoutedExperts impact: active workers in a failed peer's EP group are shut down and restarted atomically. Since recovery only runs between training steps (health checker paused), no in-flight routed-experts object refs should be held.

Architecture

The refactored recovery design is sound:

Health checker detects failures (mark inactive only, no cleanup)
Training barrier triggers group recovery (shutdown whole EP group → init → verify inference readiness)
Chat completion probe after restart catches the "server healthy but engine not ready" gap

Remaining Issues

Warning

ray.get(init_refs, timeout=ROLLOUT_RAY_GET_TIMEOUT) uses the 5-hour default during recovery. A single hung init stalls the training barrier silently for hours. Consider a shorter recovery-specific timeout.
Worker shutdown + server-down wait can cascade to ~6 minutes per worker before giving up. Acceptable but operators should be aware of worst-case recovery time for multi-worker EP groups.

Nit

lmdeploy.py timeout change (60s → 600s) is unrelated to EP recovery; consider separating.
Log messaging for EP group recovery could be clearer about which ranks are healthy but restarting due to peer failure.

Verdict

APPROVE — the core logic is correct and well-tested. The warnings are about operational visibility, not correctness.

YanhuiDua · 2026-06-08T08:29:53Z

@claude review and if you think this pr is ready, give it an approve

claude · 2026-06-08T08:35:00Z

Claude: ## Final Review Summary

PR: Support EP rollout worker group recovery

ProduceBatchResult impact: not affected — recovery runs in ensure_workers_healthy_before_training before batch accounting.

RoutedExperts impact: not affected — recovery only runs between training steps while health checker is paused; no in-flight routed-experts object refs should be held.

Assessment

The critical issues from prior reviews have been properly addressed:

✓ Shutdown/cleanup on all failure paths (_cleanup_inactive_worker_group)
✓ URL validation restored after init with early return on mismatch
✓ dist_init_addr preserved in WorkerInfo and passed back on restart
✓ Chat completion probe catches "healthy but not ready" gap

The EP group recovery design is sound — when one worker fails, the entire EP group is restarted atomically with coordinated dist_init_addr.

Remaining Notes (non-blocking)

Session URL regression: session_url is no longer populated in WorkerInfo (neither at init nor after recovery). get_rollout_info() will produce empty session URL dicts. Acceptable if session routing is being moved to PR Refine rollout worker health check and recovery lifecycle #1877, but worth confirming.
5-hour ROLLOUT_RAY_GET_TIMEOUT during recovery (from prior review): still applies but is a pre-existing concern, not introduced by this PR.

Verdict

APPROVE — the core logic is correct, well-structured, and all prior critical feedback has been addressed.

claude

Claude: Approving — critical issues from prior reviews are resolved. The EP group recovery logic is sound. One non-blocking note about session_url no longer being populated (inline comment).

claude Bot reviewed Jun 4, 2026

View reviewed changes