[Feature]report PD info to IM by ChowMingSing · Pull Request #8082 · PaddlePaddle/FastDeploy

ChowMingSing · 2026-06-26T07:03:20Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 15:08:18

📋 Review 摘要

PR 概述：新增 IM 查询 FastDeploy/PD 注册信息、ready 健康检查和 /fastdeploy/server/info 汇报接口
变更范围：fastdeploy/entrypoints/openai/api_server.py
影响面 Tag：[APIServer] [PD Disaggregation]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/entrypoints/openai/api_server.py:960`	`dp_rank` 字符串与整数比较，`is_master` 永远不会置为 1
🔴 Bug	`fastdeploy/entrypoints/openai/api_server.py:985`	async LLM 模式下 `llm_engine` 没有 `.engine`，新增 info 接口会 500

📝 PR 规范检查

标题包含官方 Tag，但当前 PR 描述各 section 仍是模板占位/空内容，建议替换为下面的完整描述。

标题建议（可直接复制）：

[APIServer] Report PD info to IM

PR 描述建议（点击展开，可直接复制）

## Motivation
Report FastDeploy PD disaggregation/register information to IM, including server identity, role, resource information, connected decode nodes, and readiness status.

## Modifications
- Add `/register_info` for decode node registration metadata.
- Add `/v2/health/ready` for IM readiness checks backed by existing `/health`.
- Add `/fastdeploy/server/info` to report API server/PD fields, resource ranges, master flag, and connected decode node list.
- Start a background decode-node poller that reads `D_IP_LIST`/`DECODE_PORTS` and collects `/register_info` from decode nodes.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

新增接口方向和变更范围清晰，但当前实现会在 master 识别和 async LLM 部署下产生错误结果/接口 500。建议先修复上述两个功能问题，并补充接口级测试后再合入。

PaddlePaddle-bot · 2026-06-26T07:11:14Z

+            with open(fed_member_file, 'r') as f:
+                fed_member_list = f.read().strip().split(',')
+                if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and \
+                        dp_rank == 0:


🔴 Bug dp_rank 在上面已经被转成字符串，这里再和整数 0 比较，条件永远为 False。

配置了 FED_MEMBER_FILE 且当前 HOST_IP 是成员列表第一个、DP rank 为 0 时，is_master 仍会保持 0，IM 侧无法识别 master 节点。

建议修复方式：保留一个整数 rank 用于逻辑判断，只在拼接 pod_name 或写入响应时再转字符串。

dp_rank = cfg.parallel_config.local_data_parallel_id # pod_name 拼接处使用 str(dp_rank) if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and dp_rank == 0: is_master = 1 cfg_dict["dp_rank"] = str(dp_rank)

PaddlePaddle-bot · 2026-06-26T07:11:14Z

+    cfg_dict["is_stopping"] = "running"
+    cfg_dict["is_master"] = is_master
+    cfg_dict["container_host_ip"] = os.getenv("HOST_IP", "None")
+    cfg_dict["free_block_num"] = llm_engine.engine.resource_manager.available_block_num()


🔴 Bug 这里直接访问 llm_engine.engine.resource_manager，在 FD_ENABLE_ASYNC_LLM=1 时会让新增接口返回 500。

load_engine() 在 async 模式下把全局 llm_engine 设置为 AsyncLLM；AsyncLLM 继承的 EngineServiceClient 只在子进程里创建 EngineService，主进程对象没有 .engine 属性。文件里已有生命周期代码也用 not isinstance(llm_engine, AsyncLLM) 区分了同步引擎路径。

建议修复方式：对 AsyncLLM 单独走跨进程状态查询/control API 获取 free_block_num，或在 async 模式下返回明确的不可用值；不要在 API server 主进程直接读取 llm_engine.engine.resource_manager。

codecov-commenter · 2026-06-26T07:38:19Z

Codecov Report

❌ Patch coverage is 8.63309% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/entrypoints/openai/api_server.py	8.63%	127 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8082   +/-   ##
==========================================
  Coverage           ?   67.39%           
==========================================
  Files              ?      475           
  Lines              ?    67048           
  Branches           ?    10335           
==========================================
  Hits               ?    45187           
  Misses             ?    18990           
  Partials           ?     2871

Flag	Coverage Δ
GPU	`77.37% <8.63%> (?)`
XPU	`6.94% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-06-27T04:11:49Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-07-01 19:35:51 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: a931d80 | Merge base: f4eda5a (branch: develop)

1 Required任务 : 7/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	35	6	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题	高	Job
`Pre Commit`	PR问题	高	Job
`Approval`	需要 Approval	高	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

分析器: 通用分析(fallback)

失败用例:

用例	错误摘要
`Verify Code Coverage Threshold (80%)`	`api_server.py` patch coverage 8.63%，127 个统计行未覆盖，低于 80% 阈值

关键日志:

GPU Patch Coverage Details:
fastdeploy/entrypoints/openai/api_server.py percent_covered=8.63309352517986
violation_lines=[131, 132, ..., 994]
total_num_lines=139, total_num_violations=127, total_percent_covered=8, num_changed_lines=224
Process completed with exit code 9.

根因摘要: 新增 IM/PD 接口缺少覆盖率

PR 只修改了 fastdeploy/entrypoints/openai/api_server.py，新增 decode node 轮询、/register_info、/v2/health/ready 和 /fastdeploy/server/info 等逻辑。单测执行成功，但新增/变更行覆盖率只有 8.63%，覆盖率校验步骤直接按 patch coverage 阈值失败。

修复建议:

为 fastdeploy/entrypoints/openai/api_server.py 新增单测，覆盖 _fetch_decode_node_register_info、_poll_decode_nodes、launch_decode_node_poller、register_info、im_check_health、im_report 的正常和异常分支。
对依赖环境变量、requests.get、llm_engine.cfg、resource_manager.available_block_num()、_decode_nodes_register_info 的逻辑使用 mock/fixture 覆盖。

关联变更: fastdeploy/entrypoints/openai/api_server.py:129, :141, :164, :363, :820, :869, :883

🔴 Pre Commit — PR问题（置信度: 高）

分析器: 通用分析(fallback)

失败用例:

用例	错误摘要
`Check pre-commit`	`api_server.py` 新增代码未满足格式化规则，pre-commit 输出 formatter diff

关键日志:

-                os.getenv("POD_NAMESPACE", "None")
-                + "_"
-                + os.getenv("FD_POD_NAME", "None")
+            os.getenv("POD_NAMESPACE", "None")
+            + "_"
+            + os.getenv("FD_POD_NAME", "None")
...
formatter diff also points to fed_member_list.index(...) condition wrapping
Process completed with exit code 1.

根因摘要: 新增代码未通过 pre-commit 格式化

失败发生在 Check pre-commit 步骤。formatter diff 指向本 PR 新增的 pod_name 拼接块和 fed_member_list 判断语句，说明提交内容未按仓库 pre-commit 配置格式化。

修复建议:

在本地执行 pre-commit run --files fastdeploy/entrypoints/openai/api_server.py 或 pre-commit run -a，提交自动格式化后的结果。
重点检查 api_server.py 中 pod_name 多行拼接缩进和 fed_member_list.index(...) 条件行 wrapping。

关联变更: fastdeploy/entrypoints/openai/api_server.py:836, :908, :959

🔴 Approval — 需要 Approval（置信度: 高）

分析器: 内置缓存(approval_required)

失败用例:

用例	错误摘要
`Approval`	该 Job 需要人工 Approval，完成审批后 CI 才会继续执行

关键日志:

Process completed with exit code 6.

根因摘要: 需要人工审批

该 Workflow 处于审批门禁失败状态，不是代码执行失败。

修复建议:

请有权限的维护者完成人工审批，然后继续或重新触发相关 CI。

关联变更: 无

[Feature]report PD info to IM

a931d80

ChowMingSing had a problem deploying to Metax_ci June 26, 2026 07:03 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]report PD info to IM#8082

[Feature]report PD info to IM#8082
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2

ChowMingSing commented Jun 26, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Uh oh!

codecov-commenter commented Jun 26, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ChowMingSing commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 Required任务 : 7/10 通过

2 失败详情

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChowMingSing commented Jun 26, 2026 •

edited

Loading

codecov-commenter commented Jun 26, 2026 •

edited

Loading

PaddlePaddle-bot commented Jun 27, 2026 •

edited

Loading