Skip to content

Fuse gdr preprocess#4656

Open
grimoire wants to merge 3 commits into
InternLM:mainfrom
grimoire:fuse-gdr-preprocess
Open

Fuse gdr preprocess#4656
grimoire wants to merge 3 commits into
InternLM:mainfrom
grimoire:fuse-gdr-preprocess

Conversation

@grimoire

@grimoire grimoire commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

This PR fuse the preprocess of gated delta rule

Requirements

@grimoire grimoire force-pushed the fuse-gdr-preprocess branch from 094c02e to 4eecbd6 Compare June 9, 2026 02:59
@grimoire grimoire marked this pull request as ready for review June 9, 2026 03:00
Copilot AI review requested due to automatic review settings June 9, 2026 03:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a fused “preprocess” path for Gated Delta Rule inputs (q/k replication + optional q/k L2-norm + beta/g computation + init-token masking), and wires models/backends to pass raw (b, a, dt_bias, a_log_exp) instead of precomputed (beta, g).

Changes:

  • Added a Triton kernel (gated_delta_preprocess) and CUDA-backend hook to fuse q/k/b/a preprocessing when use_qk_l2norm_in_kernel is enabled.
  • Updated GatedDelta call path + Qwen3 models to pass raw (b, a, dt_bias, a_log_exp) and rely on backend preprocessing.
  • Added new kernel tests covering both 3D/4D (b, a) layouts and decoding/prefill behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/pytorch/kernel/test_gated_delta_preprocess.py Adds tests validating fused preprocess outputs vs references/default prepare logic.
lmdeploy/pytorch/nn/gated_delta.py Routes preprocessing through backend prepare_inputs; updates GatedDelta signature to accept raw (b, a, dt_bias, a_log_exp).
lmdeploy/pytorch/models/qwen3_next.py Stops precomputing beta/g in-model; delegates to GatedDelta/backend.
lmdeploy/pytorch/models/qwen3_5.py Stops precomputing beta/g in-model; delegates to GatedDelta/backend.
lmdeploy/pytorch/kernels/cuda/gated_delta_preprocess.py New Triton-based fused preprocess kernel + Python wrapper.
lmdeploy/pytorch/backends/gated_delta_rule.py Adds default prepare_inputs implementation to standardize preprocessing.
lmdeploy/pytorch/backends/cuda/gated_delta_rule.py Overrides prepare_inputs to invoke fused preprocess kernel when enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/pytorch/nn/gated_delta.py
Comment thread lmdeploy/pytorch/kernels/cuda/gated_delta_preprocess.py Outdated
Comment thread tests/pytorch/kernel/test_gated_delta_preprocess.py Outdated
Comment thread tests/pytorch/kernel/test_gated_delta_preprocess.py Outdated
Comment thread tests/pytorch/kernel/test_gated_delta_preprocess.py Outdated
@grimoire grimoire changed the title [WIP] Fuse gdr preprocess Fuse gdr preprocess Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants