Support EP rollout worker group recovery#1873
Conversation
|
@claude review |
|
Claude: ## Review Summary PR: Support EP rollout worker group recovery Good feature — when one worker in an EP group fails, the entire group must restart together since EP workers depend on each other. The ProduceBatchResult impact: not directly affected — recovery occurs in RoutedExperts impact: active workers in a failed peer's EP group are now shut down during recovery. If they hold routed-experts object refs from in-flight requests, those refs will be released at shutdown. No leak risk identified since recovery only runs between training steps (no active generation). IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES — the missing shutdown in |
33e3ae6 to
76e0cb9
Compare
|
@claude review |
|
Claude: ## Re-review Summary PR: Support EP rollout worker group recovery The previous critical issues (missing shutdown after failed recovery, discarded URL validation) have been properly addressed. The new code validates ProduceBatchResult impact: not affected — recovery runs in RoutedExperts impact: active workers in a failed peer's EP group are shut down and restarted atomically. Since recovery only runs between training steps (health checker paused), no in-flight routed-experts object refs should be held. ArchitectureThe refactored recovery design is sound:
Remaining IssuesWarning
Nit
VerdictAPPROVE — the core logic is correct and well-tested. The warnings are about operational visibility, not correctness. |
f3f4c54 to
96113f2
Compare
|
@claude review and if you think this pr is ready, give it an approve |
|
Claude: ## Final Review Summary PR: Support EP rollout worker group recovery ProduceBatchResult impact: not affected — recovery runs in RoutedExperts impact: not affected — recovery only runs between training steps while health checker is paused; no in-flight routed-experts object refs should be held. AssessmentThe critical issues from prior reviews have been properly addressed:
The EP group recovery design is sound — when one worker fails, the entire EP group is restarted atomically with coordinated Remaining Notes (non-blocking)
VerdictAPPROVE — the core logic is correct, well-structured, and all prior critical feedback has been addressed. |
877a2c7 to
5a89670
Compare
/health接口并不能很好的判断推理引擎真的ready并且可以接受请求了,所以在重启时,临时增加向推理引擎发送一个/v1/chat/completions请求,该接口返回正常,才认为推理引擎重启成功说明:目前rollout controller和 rollout health checker的功能耦合比较严重,边界非常不清晰,会在下一个pr (#1877) 进行重构,把关闭/重启等功能从controller移到health checker去