[Cherry-Pick][Metric] Support custom metric labels (#7865)#7908
Conversation
…e interface Introduce MetricsManagerInterface with unified set_value/inc_value/dec_value/obs_value methods. When FD_DEFAULT_METRIC_LABEL_VALUES is set to a valid non-empty JSON dict, metric labels (e.g. model_id) are automatically applied. Otherwise, operations fall back to the raw prometheus_client calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 Required 任务 6/7 通过,仍有 1 个 Required 失败任务需要处理;失败原因为覆盖率门禁未达标,建议补充 metrics 相关单测后重跑 CI。
2 任务状态汇总2.1 Required任务 : 6/7 通过
2.2 可选任务 — 13/13 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率门禁(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无。日志显示 根因详情: 关键日志: 修复建议:
修复建议摘要: 补测MetricsManager label/ZMQ/cache分支 关联变更: 链接: 查看日志
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/online/20260415 #7908 +/- ##
==========================================================
Coverage ? 72.94%
==========================================================
Files ? 388
Lines ? 54096
Branches ? 8480
==========================================================
Hits ? 39461
Misses ? 11913
Partials ? 2722
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-26 20:59:14
📋 Review 摘要
PR 概述:新增 MetricsManagerInterface 抽象层,支持通过 FD_DEFAULT_METRIC_LABEL_VALUES 环境变量为所有 Prometheus 指标附加自定义标签(如 model_id),并将全部调用点从直接属性访问迁移至统一接口方法。
变更范围:fastdeploy/metrics/、fastdeploy/entrypoints/、fastdeploy/engine/、fastdeploy/cache_manager/、fastdeploy/output/、fastdeploy/splitwise/、fastdeploy/envs.py
影响面 Tag:[Feature] [APIServer] [Engine] [KVCache] [DataProcessor] [PD Disaggregation]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/metrics/metrics.py:795 |
spec_decode_draft_single_head_acceptance_rate 指标名称 Breaking Change,旧监控面板/告警规则将失效 |
| 📝 PR 规范 | — | 标题使用了非官方 Tag [Metric];Cherry-Pick Checklist 条目未勾选 |
📝 PR 规范检查
存在两处规范问题:①标题 [Metric] 不在官方 Tag 列表中,应改为 [Feature];② Checklist 中 Cherry-Pick 条目未勾选(本 PR 确为 Cherry-Pick 到 release 分支)。
标题建议(可直接复制):
[Cherry-Pick][Feature] Support custom metric labels (#7865)
PR 描述建议(点击展开,可直接复制)
## Motivation
Re-implement PR #4480 on current develop branch. The original PR introduced `MetricsManagerInterface` to support custom labels (e.g., `model_id`) on Prometheus metrics, but the codebase has changed significantly since then (`WorkMetricsManager` removed, new `v1/serving_chat.py` added, `internal_adapter_utils.py` no longer imports metrics, etc.).
## Modifications
1. **New file `fastdeploy/metrics/interface.py`**: Define `MetricsManagerInterface` with 4 abstract methods: `set_value`, `inc_value`, `dec_value`, `obs_value`.
2. **`fastdeploy/metrics/metrics.py`**: `MetricsManager` inherits from `MetricsManagerInterface`; parse `FD_DEFAULT_METRIC_LABEL_VALUES` env var; when set to a valid non-empty JSON dict, enable metric labels via `_patch_labelnames()`; implement the 4 interface methods with label-aware dispatch.
3. **`fastdeploy/envs.py`**: Add `FD_DEFAULT_METRIC_LABEL_VALUES` environment variable.
4. **14 call-site files**: Migrate all `main_process_metrics.<metric>.set()/inc()/dec()/observe()` calls to `set_value()/inc_value()/dec_value()/obs_value()`.
5. **`fastdeploy/metrics/metrics_middleware.py`**: Migrate HTTP metric calls to `inc_value()/obs_value()` with `labelvalues` parameter.
6. ⚠️ **Breaking Change**: `spec_decode_draft_single_head_acceptance_rate` is refactored from N separate Gauges (named `fastdeploy:spec_decode_draft_single_head_acceptance_rate_0`, `..._1`, ...) to a single Gauge with `head` label (`fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0"}`). Existing Prometheus dashboards and alert rules referencing the old metric names need to be updated.
## Usage or Command
```bash
# Enable custom labels on all metrics
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b"}'
# Or with multiple labels
export FD_DEFAULT_METRIC_LABEL_VALUES='{"model_id":"qwen3-30b","version":"v2"}'
```
When not set (default `{}`), behavior is identical to current code — no labels are added.
## Accuracy Tests
N/A — This only affects Prometheus metric formatting, no model output changes.
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Unit tests for the metrics interface are not included in this PR and can be added as a follow-up.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现思路清晰,接口抽象合理,通过 _patch_labelnames() 方案避免了修改类级别数据。需要关注投机解码指标的 Breaking Change 并在描述中明确记录,以便使用方更新 Dashboard/Alert 规则。
| for metric_name, config in patched_spec_metrics.items(): | ||
| # For Gauge metrics, automatically add multiprocess_mode="livesum" | ||
| kwargs = config["kwargs"].copy() | ||
| if config["type"] == Gauge and "multiprocess_mode" not in kwargs: |
There was a problem hiding this comment.
🟡 建议 spec_decode_draft_single_head_acceptance_rate 指标名称 Breaking Change
原实现注册 N 个独立 Gauge,指标名格式为 fastdeploy:spec_decode_draft_single_head_acceptance_rate_0, ..._1 等;新实现改为单个 Gauge + head label,指标名变为 fastdeploy:spec_decode_draft_single_head_acceptance_rate{head="0"} 等。
这是 Prometheus 指标名称的 Breaking Change,会导致:
- 已有监控 Dashboard 中查询
..._0/..._1等旧指标名的面板失效 - 基于旧指标名的告警规则失效
建议在 PR 描述中明确说明此 Breaking Change,并提供 Dashboard/Alert 规则的迁移指引;或在 release 版本说明中注明。
4486230
into
PaddlePaddle:release/online/20260415
Motivation
Re-implement PR #4480 on current develop branch. The original PR introduced
MetricsManagerInterfaceto support custom labels (e.g.,model_id) on Prometheus metrics, but the codebase has changed significantly since then (WorkMetricsManagerremoved, newv1/serving_chat.pyadded,internal_adapter_utils.pyno longer imports metrics, etc.).Modifications
New file
fastdeploy/metrics/interface.py: DefineMetricsManagerInterfacewith 4 abstract methods:set_value,inc_value,dec_value,obs_value.fastdeploy/metrics/metrics.py:MetricsManagerinherits fromMetricsManagerInterfaceFD_DEFAULT_METRIC_LABEL_VALUESenv var; when set to a valid non-empty JSON dict, enable metric labels_patch_labelnames(): add label keys from_default_labelvaluesto all metrics'labelnamesmetric.labels(**merged).set()/inc()/dec()/observe(); otherwise, callmetric.set()/inc()/dec()/observe()directlyset_cache_config_info(),record_zmq_stats(),init_zmq_metrics(),_init_speculative_metrics()with label supportfastdeploy/envs.py: AddFD_DEFAULT_METRIC_LABEL_VALUESenvironment variable14 call-site files: Migrate all
main_process_metrics.<metric>.set()/inc()/dec()/observe()calls toset_value()/inc_value()/dec_value()/obs_value()fastdeploy/metrics/metrics_middleware.py: Migrate HTTP metric.labels().inc()/.observe()toinc_value()/obs_value()withlabelvaluesparameterUsage or Command
When not set (default
{}), behavior is identical to current code — no labels are added.An example of metrics text when default label values are enabled:
Accuracy Tests
No model output changes. This only affects Prometheus metric formatting.
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.