[RL][Feature] Add GDR streaming weight update path#7951
Conversation
|
Thanks for your contribution! |
63b7abc to
b541159
Compare
Replace the custom async-to-sync GDR iterator with checkpoint_transfer's built-in receive_weights_sync API. Introduce FD_USE_CHECKPOINT_TRANSFER env flag to enable the unified path, which uses load_strategy to distinguish GDR (GPU_DIRECT) vs IPC backends internally. Key changes: - Add update_weights_by_ct() and _build_ct_transfer_config() methods - Remove update_weights_by_gdr, _receive_gdr_weights_as_sync_iterator, _resolve_transfer_mode (replaced by CT unified path) - Remove transfer_mode param from update_weights_by_rdma - Revert metax GDR additions (metax doesn't use CT path) - IPC mode: always clear cache + rebuild CUDA graph (shared GPU) - GDR mode: skip cache clear by default (training-inference separation) - Rewrite tests to cover both GDR and IPC CT paths Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
b541159 to
0e0be8a
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-29 19:25:47
📋 Review 摘要
PR 概述:新增 GPU Direct RDMA (GDR) 流式权重更新路径,通过 CheckpointTransfer 库实现训练侧权重直接流式传输到推理侧 GPU 显存
变更范围:fastdeploy/rl/、fastdeploy/worker/、fastdeploy/model_executor/
影响面 Tag:[RL] [Loader]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | gpu_model_runner.py:3144 |
RDMA 路径新增 finalize_update() 调用未同步到 metax_model_runner.py |
| 🟡 建议 | dynamic_weight_manager.py:42 |
_tensor_to_numpy_for_digest 函数在两个文件中完全重复定义 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| PR 规范 | PR 描述缺少必填段落 | ✅ 已修复 |
📝 PR 规范检查
PR 标题 [RL][Feature] Add GDR streaming weight update path 格式合规,描述包含所有必填段落(Motivation、Modifications、Usage or Command、Accuracy Tests、Checklist),符合规范。
总体评价
整体实现思路清晰,GDR 流式权重更新路径设计合理,单测覆盖充分。建议关注多硬件同步和代码复用两个改进点。
| return result | ||
| else: | ||
| result = self.dynamic_weight_manager.update_weights_by_rdma(version, verify_checksum) | ||
| self.dynamic_weight_manager.finalize_update() |
There was a problem hiding this comment.
🟡 建议 RDMA 分支新增了 finalize_update() 调用(验证参数 + 更新共享内存状态),但 metax_model_runner.py:2554 的 update_weights() 仍为旧逻辑(直接 return,无 finalize)。
这是一个行为变更:之前 GPU 的 RDMA 路径也没有调用 finalize_update(),现在补上了。建议同步到 metax:
def update_weights(self, version: str = None, verify_checksum: bool = False):
result = self.dynamic_weight_manager.update_weights_by_rdma(version, verify_checksum)
self.dynamic_weight_manager.finalize_update()
return result| ) | ||
|
|
||
|
|
||
| def _tensor_to_numpy_for_digest(tensor): |
There was a problem hiding this comment.
🟡 建议 _tensor_to_numpy_for_digest 函数与 default_loader_v1.py 中的实现完全相同(含 cannot pickle 异常处理逻辑)。
建议提取到公共模块(如 fastdeploy/model_executor/utils.py)避免重复维护,后续修 bug 只需改一处。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 CI 存在 3 个 required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 7/10 通过
2.2 可选任务 — 28/30 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 基础设施(置信度: 中)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例:
根因详情: 关键日志: 修复建议:
修复建议摘要: 不稳定问题,请 rerun 链接: 查看日志 Pre Commit — 代码规范(置信度: 高)Pre Commit
根因详情: 修复建议:
修复建议摘要: 本地运行pre-commit修复代码格式后重新提交 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7951 +/- ##
==========================================
Coverage ? 67.50%
==========================================
Files ? 467
Lines ? 65491
Branches ? 10068
==========================================
Hits ? 44207
Misses ? 18453
Partials ? 2831
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
当前 RL 场景下的模型权重更新仅支持
RDMA(通过共享内存拷贝)方式,需要先将权重落盘再加载到 GPU。本 PR 新增 GPU Direct
RDMA (GDR) 流式权重更新路径,利用 CheckpointTransfer 库的 GPU Direct
能力,实现训练侧权重直接流式传输到推理侧 GPU 显存,避免中间落盘和 CPU-GPU
拷贝开销,显著降低权重更新延迟。
Modifications
fastdeploy/rl/dynamic_weight_manager.py:update_weights_by_gdr()方法,基于 CheckpointTransfer 的GPU_DIRECT后端接收流式权重
_receive_gdr_weights_as_sync_iterator():将异步receive_weights转为同步迭代器,支持 prefetch queue 预取
_load_models_from_weight_iterator():支持主模型 + MTP辅助模型的流式加载,MTP 权重按 chunk 分批加载
_resolve_weight_update_version()和_resolve_transfer_mode()为公共方法
ProcessGroupGloo无shutdown()方法导致的clear_parameters异常fastdeploy/worker/gpu_model_runner.py:update_weights()新增transfer_mode参数,支持 GDR 模式以腾出显存(
gdr_release_cache配置)fastdeploy/worker/gpu_worker.py/metax_worker.py/metax_model_runner.py:transfer_mode参数到 model runner 层fastdeploy/entrypoints/openai/api_server.py:/v1/update_weights接口新增transfer_mode参数(可选值:rdma、gdr)fastdeploy/model_executor/utils.py:_is_gdr_dynamic_load_config()和_copy_gdr_transposed_weight_attrs()工具函数process_weight_transpose自动复制 weight loading属性到转置后的参数,确保流式加载时 weight_loader 能正确切分
测试:
tests/rl/test_dynamic_weight_gdr.py:单元测试,覆盖 GDR迭代器、MTP 分块加载、debug digest、异常处理等
tests/engine/test_common_engine.py:GDR 模式下 version mismatch校验测试
Usage or Command
Accuracy Tests
本 PR 不涉及模型前向计算逻辑变更,权重更新后模型输出与 RDMA 方式一致。GDR
传输正确性通过以下方式验证:
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.