Skip to content

[Feature]report PD info to IM#8082

Open
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2
Open

[Feature]report PD info to IM#8082
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2

Conversation

@ChowMingSing

@ChowMingSing ChowMingSing commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 15:08:18

📋 Review 摘要

PR 概述:新增 IM 查询 FastDeploy/PD 注册信息、ready 健康检查和 /fastdeploy/server/info 汇报接口
变更范围fastdeploy/entrypoints/openai/api_server.py
影响面 Tag[APIServer] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:960 dp_rank 字符串与整数比较,is_master 永远不会置为 1
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:985 async LLM 模式下 llm_engine 没有 .engine,新增 info 接口会 500

📝 PR 规范检查

标题包含官方 Tag,但当前 PR 描述各 section 仍是模板占位/空内容,建议替换为下面的完整描述。

标题建议(可直接复制):

  • [APIServer] Report PD info to IM
PR 描述建议(点击展开,可直接复制)
## Motivation
Report FastDeploy PD disaggregation/register information to IM, including server identity, role, resource information, connected decode nodes, and readiness status.

## Modifications
- Add `/register_info` for decode node registration metadata.
- Add `/v2/health/ready` for IM readiness checks backed by existing `/health`.
- Add `/fastdeploy/server/info` to report API server/PD fields, resource ranges, master flag, and connected decode node list.
- Start a background decode-node poller that reads `D_IP_LIST`/`DECODE_PORTS` and collects `/register_info` from decode nodes.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

新增接口方向和变更范围清晰,但当前实现会在 master 识别和 async LLM 部署下产生错误结果/接口 500。建议先修复上述两个功能问题,并补充接口级测试后再合入。

with open(fed_member_file, 'r') as f:
fed_member_list = f.read().strip().split(',')
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and \
dp_rank == 0:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug dp_rank 在上面已经被转成字符串,这里再和整数 0 比较,条件永远为 False。

配置了 FED_MEMBER_FILE 且当前 HOST_IP 是成员列表第一个、DP rank 为 0 时,is_master 仍会保持 0,IM 侧无法识别 master 节点。

建议修复方式:保留一个整数 rank 用于逻辑判断,只在拼接 pod_name 或写入响应时再转字符串。

dp_rank = cfg.parallel_config.local_data_parallel_id
# pod_name 拼接处使用 str(dp_rank)
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and dp_rank == 0:
    is_master = 1
cfg_dict["dp_rank"] = str(dp_rank)

cfg_dict["is_stopping"] = "running"
cfg_dict["is_master"] = is_master
cfg_dict["container_host_ip"] = os.getenv("HOST_IP", "None")
cfg_dict["free_block_num"] = llm_engine.engine.resource_manager.available_block_num()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里直接访问 llm_engine.engine.resource_manager,在 FD_ENABLE_ASYNC_LLM=1 时会让新增接口返回 500。

load_engine() 在 async 模式下把全局 llm_engine 设置为 AsyncLLMAsyncLLM 继承的 EngineServiceClient 只在子进程里创建 EngineService,主进程对象没有 .engine 属性。文件里已有生命周期代码也用 not isinstance(llm_engine, AsyncLLM) 区分了同步引擎路径。

建议修复方式:对 AsyncLLM 单独走跨进程状态查询/control API 获取 free_block_num,或在 async 模式下返回明确的不可用值;不要在 API server 主进程直接读取 llm_engine.engine.resource_manager

@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 8.63309% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/entrypoints/openai/api_server.py 8.63% 127 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8082   +/-   ##
==========================================
  Coverage           ?   67.39%           
==========================================
  Files              ?      475           
  Lines              ?    67048           
  Branches           ?    10335           
==========================================
  Hits               ?    45187           
  Misses             ?    18990           
  Partials           ?     2871           
Flag Coverage Δ
GPU 77.37% <8.63%> (?)
XPU 6.94% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 27, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-07-01 19:35:51 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: a931d80 | Merge base: f4eda5a (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 35 6 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job
Pre Commit PR问题 Job
Approval 需要 Approval Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
Verify Code Coverage Threshold (80%) api_server.py patch coverage 8.63%,127 个统计行未覆盖,低于 80% 阈值

关键日志:

GPU Patch Coverage Details:
fastdeploy/entrypoints/openai/api_server.py percent_covered=8.63309352517986
violation_lines=[131, 132, ..., 994]
total_num_lines=139, total_num_violations=127, total_percent_covered=8, num_changed_lines=224
Process completed with exit code 9.
  • 根因摘要: 新增 IM/PD 接口缺少覆盖率

PR 只修改了 fastdeploy/entrypoints/openai/api_server.py,新增 decode node 轮询、/register_info/v2/health/ready/fastdeploy/server/info 等逻辑。单测执行成功,但新增/变更行覆盖率只有 8.63%,覆盖率校验步骤直接按 patch coverage 阈值失败。

修复建议:

  1. fastdeploy/entrypoints/openai/api_server.py 新增单测,覆盖 _fetch_decode_node_register_info_poll_decode_nodeslaunch_decode_node_pollerregister_infoim_check_healthim_report 的正常和异常分支。
  2. 对依赖环境变量、requests.getllm_engine.cfgresource_manager.available_block_num()_decode_nodes_register_info 的逻辑使用 mock/fixture 覆盖。

关联变更: fastdeploy/entrypoints/openai/api_server.py:129, :141, :164, :363, :820, :869, :883

🔴 Pre Commit — PR问题(置信度: 高)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
Check pre-commit api_server.py 新增代码未满足格式化规则,pre-commit 输出 formatter diff

关键日志:

-                os.getenv("POD_NAMESPACE", "None")
-                + "_"
-                + os.getenv("FD_POD_NAME", "None")
+            os.getenv("POD_NAMESPACE", "None")
+            + "_"
+            + os.getenv("FD_POD_NAME", "None")
...
formatter diff also points to fed_member_list.index(...) condition wrapping
Process completed with exit code 1.
  • 根因摘要: 新增代码未通过 pre-commit 格式化

失败发生在 Check pre-commit 步骤。formatter diff 指向本 PR 新增的 pod_name 拼接块和 fed_member_list 判断语句,说明提交内容未按仓库 pre-commit 配置格式化。

修复建议:

  1. 在本地执行 pre-commit run --files fastdeploy/entrypoints/openai/api_server.pypre-commit run -a,提交自动格式化后的结果。
  2. 重点检查 api_server.pypod_name 多行拼接缩进和 fed_member_list.index(...) 条件行 wrapping。

关联变更: fastdeploy/entrypoints/openai/api_server.py:836, :908, :959

🔴 Approval — 需要 Approval(置信度: 高)

分析器: 内置缓存(approval_required)

失败用例:

用例 错误摘要
Approval 该 Job 需要人工 Approval,完成审批后 CI 才会继续执行

关键日志:

Process completed with exit code 6.
  • 根因摘要: 需要人工审批

该 Workflow 处于审批门禁失败状态,不是代码执行失败。

修复建议:

  1. 请有权限的维护者完成人工审批,然后继续或重新触发相关 CI。

关联变更: 无

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants