[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma by liyonghua0910 · Pull Request #8085 · PaddlePaddle/FastDeploy

liyonghua0910 · 2026-06-29T06:57:46Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…e by rdma

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-29 15:19:09

📋 Review 摘要

PR 概述：调整 rsync/RDMA 权重更新前后的 CUDA Graph、KV cache 与输入缓存重建流程。
变更范围：fastdeploy/worker/gpu_model_runner.py
影响面 Tag：[RL] [Graph Optimization] [KVCache]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/worker/gpu_model_runner.py:3313`	非 MTP 场景下 GDR release cache 重建会访问空的 `self.proposer`

📝 PR 规范检查

标题符合规范；PR 描述包含模板章节，但 Motivation、Modifications、Usage or Command、Accuracy Tests 仍为空/占位内容，建议替换为可读的完整描述。

PR 描述建议（点击展开，可直接复制）

## Motivation
修复 rsync/RDMA 权重更新过程中 CUDA Graph 未按权重变化及时清理并重建的问题，避免权重更新后继续复用旧图状态。

## Modifications
在 `fastdeploy/worker/gpu_model_runner.py` 中抽取权重更新前后的内存处理逻辑：
- GDR release cache 路径在权重更新前后清理/重建 KV cache 与 CUDA Graph。
- RDMA 权重更新路径在更新前清理 CUDA Graph，并在更新后重建 CUDA Graph。
- 权重更新后重置 `share_inputs`、MTP model inputs 以及缓存的 model/sampler 输出状态。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

当前实现有一个会阻断 GDR release cache 权重更新的空对象访问问题，需要先恢复 MTP 条件保护后再合入。

PaddlePaddle-bot · 2026-06-29T07:22:55Z

+
+            # Rebuild cache on model runner
+            if not self.enable_cache_manager_v1:
+                self.proposer.initialize_kv_cache(main_model_num_blocks=self.num_gpu_blocks)


🔴 Bug 非 MTP 场景下这里会无条件访问空的 self.proposer。

_init_speculative_proposer() 在 spec_method is None 时将 self.proposer 设为 None，但这段重建逻辑只受 rebuild_kv_cache 和 enable_cache_manager_v1 控制。只要使用 GDR 权重更新并开启 gdr_release_cache，普通非投机解码模型在 cache manager v1 关闭时就会在重建阶段抛 AttributeError，导致本次权重更新失败且 KV cache 状态不能恢复到正常服务。

建议修复方式：保留旧逻辑中的 MTP 条件，只在 self.spec_method == SpecMethod.MTP 时初始化 proposer cache；主模型 cache 仍然独立初始化。

if self.spec_method == SpecMethod.MTP: if not self.enable_cache_manager_v1: self.proposer.initialize_kv_cache(main_model_num_blocks=self.num_gpu_blocks) self.initialize_kv_cache()

codecov-commenter · 2026-06-29T08:04:04Z

Codecov Report

❌ Patch coverage is 6.45161% with 29 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/gpu_model_runner.py	6.45%	29 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8085   +/-   ##
==========================================
  Coverage           ?   67.55%           
==========================================
  Files              ?      475           
  Lines              ?    66913           
  Branches           ?    10323           
==========================================
  Hits               ?    45202           
  Misses             ?    18836           
  Partials           ?     2875

Flag	Coverage Δ
GPU	`77.57% <6.45%> (?)`
XPU	`6.95% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liyonghua0910 added 2 commits June 29, 2026 06:36

[RL] enforce cuda graph clearing and rebuilding in rsync weight updat…

37c4c41

…e by rdma

[chore] update

05dc574

liyonghua0910 had a problem deploying to Metax_ci June 29, 2026 06:57 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma#8085

[RL] enforce cuda graph clearing and rebuilding in rsync weight update by rdma#8085
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260629_fix_rsync

liyonghua0910 commented Jun 29, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 29, 2026

Uh oh!

codecov-commenter commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

liyonghua0910 commented Jun 29, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 29, 2026 •

edited

Loading