Skip to content

[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma#8085

Open
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260629_fix_rsync
Open

[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma#8085
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260629_fix_rsync

Conversation

@liyonghua0910

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-29 15:19:09

📋 Review 摘要

PR 概述:调整 rsync/RDMA 权重更新前后的 CUDA Graph、KV cache 与输入缓存重建流程。
变更范围fastdeploy/worker/gpu_model_runner.py
影响面 Tag[RL] [Graph Optimization] [KVCache]

问题

级别 文件 概述
🔴 Bug fastdeploy/worker/gpu_model_runner.py:3313 非 MTP 场景下 GDR release cache 重建会访问空的 self.proposer

📝 PR 规范检查

标题符合规范;PR 描述包含模板章节,但 MotivationModificationsUsage or CommandAccuracy Tests 仍为空/占位内容,建议替换为可读的完整描述。

PR 描述建议(点击展开,可直接复制)
## Motivation
修复 rsync/RDMA 权重更新过程中 CUDA Graph 未按权重变化及时清理并重建的问题,避免权重更新后继续复用旧图状态。

## Modifications`fastdeploy/worker/gpu_model_runner.py` 中抽取权重更新前后的内存处理逻辑:
- GDR release cache 路径在权重更新前后清理/重建 KV cache 与 CUDA Graph。
- RDMA 权重更新路径在更新前清理 CUDA Graph,并在更新后重建 CUDA Graph。
- 权重更新后重置 `share_inputs`、MTP model inputs 以及缓存的 model/sampler 输出状态。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

当前实现有一个会阻断 GDR release cache 权重更新的空对象访问问题,需要先恢复 MTP 条件保护后再合入。


# Rebuild cache on model runner
if not self.enable_cache_manager_v1:
self.proposer.initialize_kv_cache(main_model_num_blocks=self.num_gpu_blocks)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 非 MTP 场景下这里会无条件访问空的 self.proposer

_init_speculative_proposer()spec_method is None 时将 self.proposer 设为 None,但这段重建逻辑只受 rebuild_kv_cacheenable_cache_manager_v1 控制。只要使用 GDR 权重更新并开启 gdr_release_cache,普通非投机解码模型在 cache manager v1 关闭时就会在重建阶段抛 AttributeError,导致本次权重更新失败且 KV cache 状态不能恢复到正常服务。

建议修复方式:保留旧逻辑中的 MTP 条件,只在 self.spec_method == SpecMethod.MTP 时初始化 proposer cache;主模型 cache 仍然独立初始化。

if self.spec_method == SpecMethod.MTP:
    if not self.enable_cache_manager_v1:
        self.proposer.initialize_kv_cache(main_model_num_blocks=self.num_gpu_blocks)
self.initialize_kv_cache()

@codecov-commenter

codecov-commenter commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 6.45161% with 29 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu_model_runner.py 6.45% 29 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8085   +/-   ##
==========================================
  Coverage           ?   67.55%           
==========================================
  Files              ?      475           
  Lines              ?    66913           
  Branches           ?    10323           
==========================================
  Hits               ?    45202           
  Misses             ?    18836           
  Partials           ?     2875           
Flag Coverage Δ
GPU 77.57% <6.45%> (?)
XPU 6.95% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants