[Speculative Decoding]【Hackathon 10th Spring No.54】hybrid_mtp_ngram 端到端验证 by NKNaN · Pull Request #7849 · PaddlePaddle/FastDeploy

NKNaN · 2026-05-19T03:49:38Z

Motivation

PaddlePaddle/community#1372

Modifications

cudagraph 兼容（ngram_match.cu、ngram_match_mixed.cu）：在 ngram match kernel Phase 2 的 gather 末尾，当 pad_to_max=True（即 target cudagraph 开启时），始终把 seq_lens_this_time pad 到 num_speculative_tokens + 1，不足的位置用最后一个有效 draft token 作为 placeholder。
算子接口（ngram_match_mixed.cu、cpp_extensions.cc）：input_ids/input_ids_len 改为 token_ids_all/prompt_lens，pre_ids 暂保留，预计下一个pr去除。
Python 调用消除拷贝（mtp_cuda.py）：_extend_draft_token_with_ngram_match 中两次 .cuda() 替换为已在 GPU 的张量。
代码清理（mtp.py）：删除 insert_tasks_v1 中的 .cpu() D→H 拷贝、input_ids_cpu/input_ids_len 写入。
ProposerInputBatch 修改（input_batch.py）：token_ids_all 从 clone 改为引用 target 张量；删除冗余字段 input_ids_cpu/input_ids_len 及其 swap/reset 中的维护。
新增 hybrid E2E 测试（test_ernie_21b_mtp_ngram.py）：覆盖 overlap + cudagraph + logprob 。

Usage or Command

N/A

Accuracy Tests

tests/e2e/test_ernie_21b_mtp_ngram.py

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-19T03:49:48Z

Thanks for your contribution!

codecov-commenter · 2026-05-19T05:23:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@6bdbdc9). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7849   +/-   ##
==========================================
  Coverage           ?   68.82%           
==========================================
  Files              ?      467           
  Lines              ?    65084           
  Branches           ?     9980           
==========================================
  Hits               ?    44793           
  Misses             ?    17451           
  Partials           ?     2840

Flag	Coverage Δ
GPU	`78.21% <100.00%> (?)`
XPU	`19.87% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

该 PR 围绕 hybrid_mtp_ngram（Hybrid MTP + Ngram）链路做端到端验证与代码清理：统一算子接口参数语义（从 input_ids/input_ids_len 迁移到 token_ids_all/prompt_lens），并消除 MTP hybrid 路径中不必要的 D2H/H2D 拷贝，最后补充 E2E 覆盖 overlap + cudagraph + logprob 场景。

Changes:

更新 hybrid_mtp_ngram CUDA 算子接口与内部实现：prompt 搜索源改为 token_ids_all + prompt_lens。
MTP hybrid 路径消除 input_ids_cpu/input_ids_len 相关 CPU 缓冲与 .cpu()/.cuda() 拷贝，并同步调整 ProposerInputBatch 初始化/重置逻辑。
更新相关单测并新增 ERNIE 21B 的 hybrid MTP-Ngram E2E 测试用例。

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/spec_decode/test_ngram_gpu_kernel.py	更新 CPU 参考实现与数据构造，适配 `token_ids_all/prompt_lens` 接口
tests/operators/test_hybrid_mtp_ngram.py	更新算子单测输入字段与注释，匹配新接口
tests/e2e/test_ernie_21b_mtp_ngram.py	新增 hybrid MTP-Ngram 的 E2E 覆盖（stream/non-stream、speculate_metrics、logprobs）
fastdeploy/worker/input_batch.py	`ProposerInputBatch` 移除 `input_ids_cpu/input_ids_len` 维护，`token_ids_all` 改为直接引用目标 batch
fastdeploy/spec_decode/mtp.py	删除 insert/prepare 阶段对 `input_ids_len` 与 `input_ids_cpu` 的写入与 D2H 拷贝
fastdeploy/spec_decode/mtp_cuda.py	hybrid ngram 扩展调用改为直接使用 GPU 上的 `token_ids_all/prompt_lens`
custom_ops/gpu_ops/speculate_decoding/draft_model/ngram_match_mixed.cu	CUDA/CPU 路径统一改用 `token_ids_all/prompt_lens`，更新内核参数含义
custom_ops/gpu_ops/cpp_extensions.cc	同步更新 `HybridMtpNgram` C++ 声明签名

Comments suppressed due to low confidence (1)

tests/e2e/test_ernie_21b_mtp_ngram.py:259

这里对 speculate_metrics 做了严格的 dict 全等比较（==），如果服务端返回的浮点值存在舍入差异、或字段顺序/附加字段有微调，就会导致用例不稳定。若目的是回归关键行为，建议改为：对整数统计做精确比较；对 accept_ratio/average_accept_length 等浮点字段做容差比较；或通过 BaselineManager 管理可更新的基线数据。

    # Baseline comparison — exact match against the values captured in the reference environment.
    if BASELINE_SPECULATE_METRICS is not None:
        assert speculate_metrics == BASELINE_SPECULATE_METRICS, (
            f"speculate_metrics mismatch\n"
            f"got:      {json.dumps(speculate_metrics, indent=2)}\n"
            f"baseline: {json.dumps(BASELINE_SPECULATE_METRICS, indent=2)}"
        )

PaddlePaddle-bot · 2026-05-19T10:09:17Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-21 09:41:26

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 55f2706
Merge base: b2fc2c6 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

Required 任务存在 2 个失败项：Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 覆盖率未达标，以及 Approval 需要人工审批；需处理后再合入。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	37	5	0	0	0

2 任务状态汇总

日志列说明：失败任务直接使用日志链接；运行中任务链接到对应 Job。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	1h23m	PR问题：Diff覆盖率50%，input_batch.py:973未覆盖	为reset_model_inputs分支补测试	Job	-
❌	`Approval`	11s	需要 Approval	请通过人工审批	Job	-
✅	其余 8 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	16m24s	Job	-
❌	`CI_HPU`	1h4m	Job	-
❌	`Trigger Jenkins for PR`	7m33s	Job	-
✅	其余 29 个可选任务通过	-	-	-

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率失败（置信度: 高）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

状态: ❌ 失败
错误类型: 覆盖率失败
置信度: 高
根因摘要: Diff覆盖率50%，input_batch.py:973未覆盖
分析器: ci_analyze_unittest_fastdeploy

失败用例: 无（单测通过，失败发生在覆盖率阈值校验阶段）

根因详情:
单测阶段已通过，但 Verify Code Coverage Threshold (80%) 步骤返回 COVERAGE_EXIT_CODE=9。覆盖率产物 diff_coverage.json 显示本 PR 的 Diff 覆盖率只有 50%，低于 80% 阈值；唯一未覆盖行是 fastdeploy/worker/input_batch.py:973，对应 ProposerInputBatch.reset_model_inputs() 中 CUDA 且存在 token_ids_all 的分支。

关键日志:

All tests passed
Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
{"src_stats":{"fastdeploy/worker/input_batch.py":{"percent_covered":50.0,"violation_lines":[973],"covered_lines":[773],"violations":[[973,null]]}},"total_percent_covered":50,"num_changed_lines":461}
##[error]Process completed with exit code 9.

代码上下文核对:

已确认 fastdeploy/worker/input_batch.py:973 位于 ProposerInputBatch.reset_model_inputs() 的 CUDA 分支：self.token_ids_all = self.target_model_input_batch["token_ids_all"]。
已查到可参考补测位置：tests/worker/test_gpu_model_runner.py；也可新增 tests/worker/test_input_batch.py。

修复建议:

在 tests/worker/test_gpu_model_runner.py 或新增 tests/worker/test_input_batch.py 中补充 ProposerInputBatch.reset_model_inputs() 的单测，构造含 token_ids_all / prompt_lens 的 target_model_input_batch，覆盖 fastdeploy/worker/input_batch.py:973。
若该行无法在单测环境稳定覆盖，可将 reset_model_inputs() 中 token_ids_all 初始化逻辑抽成可测试的小函数，并对 CUDA/token_ids_all 分支做直接单测。

修复建议摘要: 为reset_model_inputs分支补测试

关联变更: fastdeploy/worker/input_batch.py:973（token_ids_all 从 clone 改为引用 target 张量）
链接: 查看日志

Approval — 需要人工审批（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

freeliuzc · 2026-05-19T13:00:47Z

代码整体实现没问题，缺少一份置信的性能以及接受率报告来佐证功能正确。
仿照 FastDeploy/benchmarks/README.md ，使用 filtered_sharedgpt_2000_input_1136_output_200_fd 数据集，对 non-spec/ngram/mtp(1步和3步)/mtp(3步)+hybrid 出一份性能报告，以及 speculate.log 里的接受率统计

PaddlePaddle-bot · 2026-05-20T18:34:06Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-28 17:36:22

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: b99c613
Merge base: 6bdbdc9 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

Required 任务已全部通过（10/10），当前无阻塞合并的 CI 失败；可选任务存在 2 个失败、1 个等待中，仅供参考。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
60(14)	46	42	2	0	1	1

2 任务状态汇总

日志列说明：失败任务直接使用 CI 详情链接，运行中/等待中任务如有 Job 链接则展示。

2.1 Required任务 : 10/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	其余 10 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 32/36 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	2m11s	Job	-
❌	`Trigger Jenkins for PR`	21s	Job	-
⏸️	`CI_HPU`	-	-	-
⏭️	`cherry-pick`	-	-	-
✅	其余 32 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

NKNaN · 2026-05-22T10:18:48Z

在进一步测试中发现：max-num-seqs=128, max-model-len=32768（之前只测了 max-num-seqs=8, max-model-len=4096）下，开启 cudagraph 的 hybrid 配置服务启动失败 —— target model cudagraph piece replay 阶段报 CUDA error(700) illegal memory access。

原因：hybrid_mtp_ngram kernel 把 seq_lens_this_time 改写为 1 + num_model_steps + ngram_hits，其中 ngram_hits 的值会随真实 ngram 命中数变动。由于 cudagraph 把 launch param（grid dim、kernel args 里依赖 seq_lens_this_time 的索引张量大小）在 capture 时烧进 graph node，capture 时的 seq_lens_this_time 是 dummy run 下 hybrid 写出的某个具体值。replay 时 hybrid 写出的 seq_lens_this_time 可能不同，下游 kernel（attention / append_kv / sampler logits gather）按 capture-time grid 启动，多出来的 thread block 沿着脏的 cu_seqlens / slot_mapping 寻址越界，踩到未映射页 → CUDA 700。

修复：在 hybrid kernel Phase 2 gather 末尾，当 pad_to_max=True（即 target cudagraph 开启时），始终把 seq_lens_this_time pad 到 num_speculative_tokens + 1，不足的位置用最后一个有效 draft token 作为 placeholder。

NKNaN · 2026-05-25T10:48:59Z

性能测试报告：https://github.com/NKNaN/Hybrid-Mtp-Ngram-E2E

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-28 14:55:43

📋 Review 摘要

PR 概述：为 hybrid MTP-ngram 投机解码新增 CUDAGraph 兼容（padding seq_lens_this_time）、接口重命名（input_ids→token_ids_all）、消除冗余 D→H 拷贝，并补充 E2E 测试。
变更范围：custom_ops/gpu_ops/speculate_decoding/、fastdeploy/spec_decode/、fastdeploy/worker/input_batch.py、tests/
影响面 Tag：[Speculative Decoding] [OP]

问题

级别	文件	概述
🟡 建议	`tests/operators/test_hybrid_mtp_ngram.py:89`	`pad_to_max=True` 路径缺少 kernel 级单测，仅靠 E2E 间接覆盖

历史 Findings 修复情况

Finding	问题	状态
F1	`accept_ratio_per_head` 基线只有 5 个元素 vs `accepted_tokens_per_head` 有 6 个	⚠️ 仍存在
F2	静态 scratch buffer 非线程安全	⚠️ 仍存在
F3	`accepted_tokens`/`rejected_tokens` 严格整数比较可能导致环境敏感	⚠️ 仍存在

📝 PR 规范检查

标题含有效 Tag [Speculative Decoding]，描述结构完整（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 均已填写），规范合规。

总体评价

实现思路清晰：通过 pad_to_max 参数统一 cudagraph capture/replay 时的 seq_lens_this_time，同时消除了冗余的 CPU tensor 和 D→H 拷贝，是合理的性能优化。建议补充 pad_to_max=True 的 kernel 级单测以提高可调试性。

PaddlePaddle-bot · 2026-05-28T06:59:12Z

@@ -81,6 +89,7 @@ def test_ngram_match_mixed(self):
            self.max_ngram_size,


🟡 建议 pad_to_max=True 路径缺少 kernel 级单测

当前仅测试 pad_to_max=False，CUDAGraph 填充逻辑（padding seq_lens_this_time 至 max_draft_tokens + 1 并写入 placeholder token）仅通过 E2E 测试间接覆盖。

建议新增一个 test_ngram_match_mixed_pad_to_max 用例，传入 pad_to_max=True，断言：

seq_lens_this_time 所有 active batch item 均等于 max_draft_tokens + 1

draft_tokens[actual:target_slt] 填充的是最后一个有效 draft token

def test_ngram_match_mixed_pad_to_max(self): """pad_to_max=True: seq_lens_this_time is padded to max_draft_tokens+1.""" hybrid_mtp_ngram( self.token_ids_all, self.prompt_lens, self.pre_ids, self.step_idx, self.draft_token_num, self.draft_tokens, self.seq_lens_this_time, self.seq_lens_decoder, self.max_dec_len, self.max_ngram_size, self.min_ngram_size, self.max_draft_tokens, True, # pad_to_max ) expected_slt = self.max_draft_tokens + 1 np.testing.assert_array_equal( self.seq_lens_this_time.numpy(), np.full_like(self.ref_seq_lens_this_time, expected_slt), )

freeliuzc · 2026-05-28T09:22:48Z

LGTM

NKNaN had a problem deploying to Metax_ci May 19, 2026 03:49 — with GitHub Actions Error

paddle-bot Bot added the contributor External developers label May 19, 2026

NKNaN had a problem deploying to Metax_ci May 19, 2026 04:14 — with GitHub Actions Failure

freeliuzc requested a review from Copilot May 19, 2026 07:17

Copilot started reviewing on behalf of freeliuzc May 19, 2026 07:18 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread tests/e2e/test_ernie_21b_mtp_ngram.py Outdated

NKNaN had a problem deploying to Metax_ci May 19, 2026 07:48 — with GitHub Actions Error

NKNaN force-pushed the spec-mtp-ngram branch from 0488653 to 87b0941 Compare May 19, 2026 07:52

NKNaN had a problem deploying to Metax_ci May 19, 2026 07:52 — with GitHub Actions Error

NKNaN temporarily deployed to Metax_ci May 19, 2026 07:56 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

luotao1 mentioned this pull request May 20, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 PaddlePaddle/Paddle#77429

Open

NKNaN had a problem deploying to Metax_ci May 20, 2026 05:55 — with GitHub Actions Failure

NKNaN force-pushed the spec-mtp-ngram branch from d146629 to f70bfb2 Compare May 20, 2026 05:59

NKNaN temporarily deployed to Metax_ci May 20, 2026 05:59 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

NKNaN had a problem deploying to Metax_ci May 20, 2026 09:22 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

luotao1 added the PaddlePaddle Hackathon label May 22, 2026

luotao1 assigned luotao1 and freeliuzc May 22, 2026

NKNaN had a problem deploying to Metax_ci May 25, 2026 10:32 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

NKNaN had a problem deploying to Metax_ci May 25, 2026 10:57 — with GitHub Actions Failure

NKNaN had a problem deploying to Metax_ci May 25, 2026 11:04 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

NKNaN had a problem deploying to Metax_ci May 26, 2026 01:10 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

freeliuzc previously approved these changes May 28, 2026

View reviewed changes

NKNaN and others added 10 commits May 28, 2026 12:00

update hybrid mtp ngram kernel signature

b2bfc96

update unittests

d80757e

Potential fix for pull request finding

de69229

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

codestyle fix

1fd6dc3

fix test

7ed8d05

update test

0a2b51e

update hybrid kernel to adapt cudagraph

a2cae2c

update ngram kernel with the same cudagraph adapting logic

d103a77

update test

1aaef55

fix unittest

d14fc42

NKNaN dismissed freeliuzc’s stale review via d14fc42 May 28, 2026 04:01

NKNaN force-pushed the spec-mtp-ngram branch from 7c4d846 to d14fc42 Compare May 28, 2026 04:01

NKNaN had a problem deploying to Metax_ci May 28, 2026 04:01 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' into spec-mtp-ngram

b99c613

plusNew001 had a problem deploying to Metax_ci May 28, 2026 06:48 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 28, 2026

View reviewed changes

freeliuzc approved these changes May 28, 2026

View reviewed changes

freeliuzc merged commit 6701886 into PaddlePaddle:develop May 28, 2026
50 of 57 checks passed

		@@ -81,6 +89,7 @@ def test_ngram_match_mixed(self):
		self.max_ngram_size,

Conversation

NKNaN commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 19, 2026

Uh oh!

codecov-commenter commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 8/10 通过

2.2 可选任务 — 29/32 通过

3 失败详情（仅 required）

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

Uh oh!

freeliuzc commented May 19, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 10/10 通过

2.2 可选任务 — 32/36 通过

3 失败详情（仅 required）

Uh oh!

NKNaN commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NKNaN commented May 25, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

freeliuzc commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

NKNaN commented May 19, 2026 •

edited

Loading

codecov-commenter commented May 19, 2026 •

edited

Loading

PaddlePaddle-bot commented May 19, 2026 •

edited

Loading

PaddlePaddle-bot commented May 20, 2026 •

edited

Loading

NKNaN commented May 22, 2026 •

edited

Loading