Add prefill-decode and batch-prefill-decode for Qwen3 (FP16 and Q8_0) by orionpapadakis · Pull Request #122 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-06-11T13:33:32Z

Summary

This PR extends the prefill-decode and batch-prefill-decode GPU inference paths to Qwen3 models for both FP16 and Q8_0 quantizations. It also adds corresponding coverage in CI.

The core challenge is that Qwen3 diverges from Llama in two ways that affect every attention kernel:

Per-head QK RMSNorm — Qwen3 applies a separate RMSNorm to each query and key head before RoPE. This requires a dedicated parallel reduction kernel (Qwen3Kernels.rmsnormReductionWithParallelOffset /
rmsnormNormalisationWithParallelOffset) and cannot reuse the scalar Llama path.
GQA layout mismatch — Qwen3Configuration.headSize(), kvDim(), and kvMul() throw for Qwen3 (dimensions are stored differently). All new Qwen3 layers derive head dimensions from nEmbdHeadK /
nEmbdHeadV / nHeadKv directly.

State changes

State.qDim and State.kvDim are made model-agnostic (previously carried Llama assumptions) so batch-prefill activations can be correctly sized for Qwen3's GQA layout. LlamaState and LlamaConfiguration
references in batch-prefill/batch-decode activations are replaced with the base State and Configuration types.

Verification

export MODEL=~/LLMModels/Qwen3-0.6B-Q8_0.gguf      # or Qwen3-4B-f16.gguf
export PROMPT="Explain Newton's second law"

# Single-token (unchanged baseline)
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256

# Prefill-decode
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256 \
  --with-prefill-decode

# Batch-prefill-decode
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256 \
  --with-prefill-decode --batch-prefill-size 32

# Batch-prefill-decode + CUDA graphs (PTX only)
./llama-tornado --gpu --ptx --model $MODEL --prompt "$PROMPT" --max-tokens 256 \
  --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

All existing Llama FP16 and Q8_0 paths (single-token, prefill-decode, batch-prefill-decode, CUDA graphs) are unaffected.

CI

Added CI steps for all four configurations × two quantizations, mirroring the existing Llama coverage:
- FP16 / Q8_0 — prefill-decode and batch-prefill-decode (both backends)
- PTX — prefill-decode-cuda-graphs and batch-prefill-decode-cuda-graphs

…`State` and `Configuration` in batch-prefill and batch-decode activations

…efill activations

… for Qwen3 models and FP16 and Q8_0 quantizations

stratika · 2026-06-12T11:29:37Z

+        WorkerGrid rmsWorker        = WorkerGridFactory.genericWorker(batchSize, 1);
+        WorkerGrid rmsApplyWorker   = WorkerGridFactory.genericWorker(batchSize * dim, 256);


does it pass code formatting?

stratika

LGTM, just minor comments/questions. I tested it on macOS and Linux (with PTX).

orionpapadakis added 5 commits June 11, 2026 16:19

[prf/dec] Replace LlamaState and LlamaConfiguration with generic …

9ee34a1

…`State` and `Configuration` in batch-prefill and batch-decode activations

[prf/dec] Make State's qDim and kvDim model-agnostic for batch-pr…

d23d10e

…efill activations

[prf/dec] Add initial impl of prefill-decode and batch-prefill-decode…

474255c

… for Qwen3 models and FP16 and Q8_0 quantizations

[prf/dec][ci] Add prefill-decode variants ci steps for Qwen3

f5d002a

[prf/dec][ci] Use Qwen3-0.6B instead of Qwen3-4B in CI workflows

e0cddbe

orionpapadakis requested review from mikepapadim and stratika June 11, 2026 15:45

orionpapadakis added enhancement New feature or request prefill-decode labels Jun 11, 2026

stratika reviewed Jun 12, 2026

View reviewed changes

stratika approved these changes Jun 12, 2026

View reviewed changes

mikepapadim approved these changes Jun 12, 2026

View reviewed changes

orionpapadakis merged commit 00bab9e into main Jun 15, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefill-decode and batch-prefill-decode for Qwen3 (FP16 and Q8_0)#122

Add prefill-decode and batch-prefill-decode for Qwen3 (FP16 and Q8_0)#122
orionpapadakis merged 5 commits into
mainfrom
feat/qwen3-prefill-decode

orionpapadakis commented Jun 11, 2026

Uh oh!

stratika Jun 12, 2026

Uh oh!

Uh oh!

stratika left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		WorkerGrid rmsWorker = WorkerGridFactory.genericWorker(batchSize, 1);
		WorkerGrid rmsApplyWorker = WorkerGridFactory.genericWorker(batchSize * dim, 256);

Conversation

orionpapadakis commented Jun 11, 2026

Summary

State changes

Verification

Uh oh!

stratika Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stratika left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants