Skip to content

Add prefill-decode and batch-prefill-decode for Qwen3 (FP16 and Q8_0)#122

Merged
orionpapadakis merged 5 commits into
mainfrom
feat/qwen3-prefill-decode
Jun 15, 2026
Merged

Add prefill-decode and batch-prefill-decode for Qwen3 (FP16 and Q8_0)#122
orionpapadakis merged 5 commits into
mainfrom
feat/qwen3-prefill-decode

Conversation

@orionpapadakis

Copy link
Copy Markdown
Collaborator

Summary

This PR extends the prefill-decode and batch-prefill-decode GPU inference paths to Qwen3 models for both FP16 and Q8_0 quantizations. It also adds corresponding coverage in CI.

The core challenge is that Qwen3 diverges from Llama in two ways that affect every attention kernel:

  • Per-head QK RMSNorm — Qwen3 applies a separate RMSNorm to each query and key head before RoPE. This requires a dedicated parallel reduction kernel (Qwen3Kernels.rmsnormReductionWithParallelOffset /
    rmsnormNormalisationWithParallelOffset) and cannot reuse the scalar Llama path.
  • GQA layout mismatchQwen3Configuration.headSize(), kvDim(), and kvMul() throw for Qwen3 (dimensions are stored differently). All new Qwen3 layers derive head dimensions from nEmbdHeadK /
    nEmbdHeadV / nHeadKv directly.

State changes

State.qDim and State.kvDim are made model-agnostic (previously carried Llama assumptions) so batch-prefill activations can be correctly sized for Qwen3's GQA layout. LlamaState and LlamaConfiguration
references in batch-prefill/batch-decode activations are replaced with the base State and Configuration types.

Verification

export MODEL=~/LLMModels/Qwen3-0.6B-Q8_0.gguf      # or Qwen3-4B-f16.gguf
export PROMPT="Explain Newton's second law"

# Single-token (unchanged baseline)
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256

# Prefill-decode
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256 \
  --with-prefill-decode

# Batch-prefill-decode
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256 \
  --with-prefill-decode --batch-prefill-size 32

# Batch-prefill-decode + CUDA graphs (PTX only)
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256 \
  --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

All existing Llama FP16 and Q8_0 paths (single-token, prefill-decode, batch-prefill-decode, CUDA graphs) are unaffected.

CI

Added CI steps for all four configurations × two quantizations, mirroring the existing Llama coverage:
- FP16 / Q8_0 — prefill-decode and batch-prefill-decode (both backends)
- PTX — prefill-decode-cuda-graphs and batch-prefill-decode-cuda-graphs

Comment on lines +198 to +199
WorkerGrid rmsWorker = WorkerGridFactory.genericWorker(batchSize, 1);
WorkerGrid rmsApplyWorker = WorkerGridFactory.genericWorker(batchSize * dim, 256);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it pass code formatting?

@stratika stratika left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor comments/questions. I tested it on macOS and Linux (with PTX).

@orionpapadakis orionpapadakis merged commit 00bab9e into main Jun 15, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request prefill-decode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants