Skip to content

[Cherry-Pick][Metric] Support custom metric labels (#7865)#7908

Merged
Jiang-Jia-Jun merged 8 commits into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260520_metric_labels
May 28, 2026
Merged

[Cherry-Pick][Metric] Support custom metric labels (#7865)#7908
Jiang-Jia-Jun merged 8 commits into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260520_metric_labels

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

@liyonghua0910 liyonghua0910 commented May 25, 2026

Motivation

Re-implement PR #4480 on current develop branch. The original PR introduced MetricsManagerInterface to support custom labels (e.g., model_id) on Prometheus metrics, but the codebase has changed significantly since then (WorkMetricsManager removed, new v1/serving_chat.py added, internal_adapter_utils.py no longer imports metrics, etc.).

Modifications

  1. New file fastdeploy/metrics/interface.py: Define MetricsManagerInterface with 4 abstract methods: set_value, inc_value, dec_value, obs_value.

  2. fastdeploy/metrics/metrics.py:

    • MetricsManager inherits from MetricsManagerInterface
    • Parse FD_DEFAULT_METRIC_LABEL_VALUES env var; when set to a valid non-empty JSON dict, enable metric labels
    • _patch_labelnames(): add label keys from _default_labelvalues to all metrics' labelnames
    • Implement the 4 interface methods: when labels enabled, call metric.labels(**merged).set()/inc()/dec()/observe(); otherwise, call metric.set()/inc()/dec()/observe() directly
    • Handle set_cache_config_info(), record_zmq_stats(), init_zmq_metrics(), _init_speculative_metrics() with label support
  3. fastdeploy/envs.py: Add FD_DEFAULT_METRIC_LABEL_VALUES environment variable

  4. 14 call-site files: Migrate all main_process_metrics.<metric>.set()/inc()/dec()/observe() calls to set_value()/inc_value()/dec_value()/obs_value()

  5. fastdeploy/metrics/metrics_middleware.py: Migrate HTTP metric .labels().inc()/.observe() to inc_value()/obs_value() with labelvalues parameter

Usage or Command

# Enable custom labels on all metrics
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b"}'

# Or with multiple labels
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b","version":"v2"}'

When not set (default {}), behavior is identical to current code — no labels are added.

An example of metrics text when default label values are enabled:

# HELP fastdeploy:spec_decode_draft_single_head_acceptance_rate Single head acceptance rate of speculative decoding
# TYPE fastdeploy:spec_decode_draft_single_head_acceptance_rate gauge
fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0",model_id="qwen3-30b",version="v2"} 0.9

Accuracy Tests

No model output changes. This only affects Prometheus metric formatting.

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. This PR only changes Prometheus metric label routing logic with no model output changes; unit tests for the metrics interface are not included in this PR and can be added as a follow-up.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…e interface

Introduce MetricsManagerInterface with unified set_value/inc_value/dec_value/obs_value methods.
When FD_DEFAULT_METRIC_LABEL_VALUES is set to a valid non-empty JSON dict, metric labels
(e.g. model_id) are automatically applied. Otherwise, operations fall back to the raw
prometheus_client calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 25, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 25, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-28 16:37:15

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 任务 6/7 通过,仍有 1 个 Required 失败任务需要处理;失败原因为覆盖率门禁未达标,建议补充 metrics 相关单测后重跑 CI。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
20(0) 20 19 1 0 0 0

2 任务状态汇总

2.1 Required任务 : 6/7 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h17m PR问题:metrics.py diff覆盖率61%,总覆盖率78% 补测MetricsManager label/ZMQ/cache分支 Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 13/13 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
其余 13 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率门禁(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 代码规范 / 覆盖率门禁
  • 置信度: 高
  • 根因摘要: metrics.py diff覆盖率61%,总覆盖率78%
  • 分析器: ci_analyze_unittest_fastdeploy(覆盖率门禁分支)

失败用例: 无。日志显示 TEST_EXIT_CODE: 0All tests passed,失败发生在覆盖率校验步骤。

根因详情:
本次 PR 新增/修改了 fastdeploy/metrics/metrics.py 中自定义 metric label 支持、set_value/inc_value/dec_value/obs_value 封装、ZMQ/cache/speculative metrics 适配等逻辑。diff coverage 报告显示该文件仅 61.0%,缺失 39 行,导致本 PR 总 diff 覆盖率为 78%,低于 diff-cover --fail-under=80 的 CI 门禁。

关键日志:

All tests passed
Coverage generation failed (exit code 9)
fastdeploy/metrics/metrics.py (61.0%): Missing lines 639-646,648-650,659-660,663,712-715,729,737,745,809-810,827-830,833-837,840,850,861-863,874,881
Total:   178 lines
Missing: 39 lines
total_percent_covered: 78

修复建议:

  1. tests/metrics/ 下新增或扩展 MetricsManager 单测,覆盖 fastdeploy/metrics/metrics.py L631-L650 的 _patch_labelnames() 有/无 labelnames 分支,以及 L657-L663 的 FD_DEFAULT_METRIC_LABEL_VALUES JSON 解析成功/失败分支。
  2. 补测 L698-L747 的 _get_metric_and_labels()inc_value()dec_value()obs_value() 在启用默认 label 和传入额外 labelvalues 时会正确合并并调用 metric.labels(**merged)
  3. 补测 L807-L840 的 init_zmq_metrics() / record_zmq_stats()、L842-L876 的 set_cache_config_info(),以及 L878-L881 的 speculative metrics 注册路径,覆盖 diff 报告列出的剩余缺失行。

修复建议摘要: 补测MetricsManager label/ZMQ/cache分支

关联变更: fastdeploy/metrics/metrics.pyfastdeploy/metrics/interface.pyfastdeploy/metrics/metrics_middleware.py

链接: 查看日志


说明:本次 Required 失败任务命中历史分析缓存(cache hit=1, miss=0),未重新触发深度日志分析;已额外读取 PR 相关 metrics 代码与测试文件核对覆盖率缺口。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

Codecov Report

❌ Patch coverage is 73.03371% with 48 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/online/20260415@40d3f3e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/metrics/metrics.py 52.00% 39 Missing and 9 partials ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20260415    #7908   +/-   ##
==========================================================
  Coverage                           ?   72.94%           
==========================================================
  Files                              ?      388           
  Lines                              ?    54096           
  Branches                           ?     8480           
==========================================================
  Hits                               ?    39461           
  Misses                             ?    11913           
  Partials                           ?     2722           
Flag Coverage Δ
GPU 72.94% <73.03%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-26 20:59:14

📋 Review 摘要

PR 概述:新增 MetricsManagerInterface 抽象层,支持通过 FD_DEFAULT_METRIC_LABEL_VALUES 环境变量为所有 Prometheus 指标附加自定义标签(如 model_id),并将全部调用点从直接属性访问迁移至统一接口方法。
变更范围fastdeploy/metrics/fastdeploy/entrypoints/fastdeploy/engine/fastdeploy/cache_manager/fastdeploy/output/fastdeploy/splitwise/fastdeploy/envs.py
影响面 Tag[Feature] [APIServer] [Engine] [KVCache] [DataProcessor] [PD Disaggregation]

问题

级别 文件 概述
🟡 建议 fastdeploy/metrics/metrics.py:795 spec_decode_draft_single_head_acceptance_rate 指标名称 Breaking Change,旧监控面板/告警规则将失效
📝 PR 规范 标题使用了非官方 Tag [Metric];Cherry-Pick Checklist 条目未勾选

📝 PR 规范检查

存在两处规范问题:①标题 [Metric] 不在官方 Tag 列表中,应改为 [Feature];② Checklist 中 Cherry-Pick 条目未勾选(本 PR 确为 Cherry-Pick 到 release 分支)。

标题建议(可直接复制):

  • [Cherry-Pick][Feature] Support custom metric labels (#7865)
PR 描述建议(点击展开,可直接复制)
## Motivation

Re-implement PR #4480 on current develop branch. The original PR introduced `MetricsManagerInterface` to support custom labels (e.g., `model_id`) on Prometheus metrics, but the codebase has changed significantly since then (`WorkMetricsManager` removed, new `v1/serving_chat.py` added, `internal_adapter_utils.py` no longer imports metrics, etc.).

## Modifications

1. **New file `fastdeploy/metrics/interface.py`**: Define `MetricsManagerInterface` with 4 abstract methods: `set_value`, `inc_value`, `dec_value`, `obs_value`.
2. **`fastdeploy/metrics/metrics.py`**: `MetricsManager` inherits from `MetricsManagerInterface`; parse `FD_DEFAULT_METRIC_LABEL_VALUES` env var; when set to a valid non-empty JSON dict, enable metric labels via `_patch_labelnames()`; implement the 4 interface methods with label-aware dispatch.
3. **`fastdeploy/envs.py`**: Add `FD_DEFAULT_METRIC_LABEL_VALUES` environment variable.
4. **14 call-site files**: Migrate all `main_process_metrics.<metric>.set()/inc()/dec()/observe()` calls to `set_value()/inc_value()/dec_value()/obs_value()`.
5. **`fastdeploy/metrics/metrics_middleware.py`**: Migrate HTTP metric calls to `inc_value()/obs_value()` with `labelvalues` parameter.
6. ⚠️ **Breaking Change**: `spec_decode_draft_single_head_acceptance_rate` is refactored from N separate Gauges (named `fastdeploy:spec_decode_draft_single_head_acceptance_rate_0`, `..._1`, ...) to a single Gauge with `head` label (`fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0"}`). Existing Prometheus dashboards and alert rules referencing the old metric names need to be updated.

## Usage or Command

```bash
# Enable custom labels on all metrics
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b"}'

# Or with multiple labels
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b","version":"v2"}'
```

When not set (default `{}`), behavior is identical to current code — no labels are added.

## Accuracy Tests

N/A — This only affects Prometheus metric formatting, no model output changes.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Unit tests for the metrics interface are not included in this PR and can be added as a follow-up.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰,接口抽象合理,通过 _patch_labelnames() 方案避免了修改类级别数据。需要关注投机解码指标的 Breaking Change 并在描述中明确记录,以便使用方更新 Dashboard/Alert 规则。

for metric_name, config in patched_spec_metrics.items():
# For Gauge metrics, automatically add multiprocess_mode="livesum"
kwargs = config["kwargs"].copy()
if config["type"] == Gauge and "multiprocess_mode" not in kwargs:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 spec_decode_draft_single_head_acceptance_rate 指标名称 Breaking Change

原实现注册 N 个独立 Gauge,指标名格式为 fastdeploy:spec_decode_draft_single_head_acceptance_rate_0, ..._1 等;新实现改为单个 Gauge + head label,指标名变为 fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0"} 等。

这是 Prometheus 指标名称的 Breaking Change,会导致:

  • 已有监控 Dashboard 中查询 ..._0 / ..._1 等旧指标名的面板失效
  • 基于旧指标名的告警规则失效

建议在 PR 描述中明确说明此 Breaking Change,并提供 Dashboard/Alert 规则的迁移指引;或在 release 版本说明中注明。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 4486230 into PaddlePaddle:release/online/20260415 May 28, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants