Skip to content

perf(native-cpu): Q6_K NEON matmul kernel#768

Merged
michalharakal merged 2 commits into
developfrom
perf/q6k-neon-kernel
Jun 27, 2026
Merged

perf(native-cpu): Q6_K NEON matmul kernel#768
michalharakal merged 2 commits into
developfrom
perf/q6k-neon-kernel

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

What

Adds a Q6_K NEON matmul kernel to skainet-backend-native-cpu, binding the last quant format that fell back to scalar on the Kotlin/Native board path. Q4_K / Q5_K / Q8_0 / Q4_0 were already bound; TinyLlama's 21 Q6_K tensors were the remaining scalar fallback.

How

  • native/src/q6k_matmul.c — canonical 256-element / 210-byte Q6_K super-block (ql low nibbles + qh high-2-bit plane + 16 int8 scales + FP16 d), dequant d * scale * (code - 32). The 6-bit bit-assembly is an exact transcription of ScalarQ6_KMatmulKernel / ggml dequantize_row_q6_K (auto-vectorized under -O3), feeding a NEON dot product (vfmaq_f32 + vaddvq_f32) behind __ARM_NEON; scalar dot on x64.
  • skainet_kernels.h + CMakeLists.txt: declare/compile skainet_q6k_matmul.
  • NativeKnQ6KMatmulKernel (K/N cinterop) + NativeKnKernelProvider.matmulQ6K() route.
  • NativeQ6KMatmulKernel (JVM FFM) + NativeKernelProvider.matmulQ6K() route.

Validation

  • NativeQ6KMatmulKernelParityTest (jvmTest, vs PanamaVectorQ6_KMatmulKernel) — 5/5 pass. On Apple Silicon the bundled dylib is compiled -march=armv8.2-a+fp16+dotprod, so this exercises the actual NEON code path on the host.
  • NativeKnQ6KMatmulKernelParityTest (linuxX64Test, vs ScalarQ6_KMatmulKernel) — compiles clean; runs on Linux/CI.

Follow-up (not in this PR)

  • Cross-build the aarch64 archive (-PcrossArm64=true) + on-board (SL2619 Cortex-A55) tok/s measurement — needs the cross toolchain + board hardware.

🤖 Generated with Claude Code

michalharakal and others added 2 commits June 26, 2026 22:16
…back format

TinyLlama's 21 Q6_K tensors fell back to scalar on the Kotlin/Native board
path (only Q4_K/Q5_K/Q8_0/Q4_0 were bound). This adds the Q6_K kernel so the
board (linuxArm64) NEON-accelerates them too.

- native/src/q6k_matmul.c: canonical 256-elem/210-byte Q6_K super-block
  (ql + qh high-2-bit plane + 16 int8 scales + FP16 d), dequant
  d * scale * (code - 32). Scalar 6-bit bit-assembly (exact transcription of
  ScalarQ6_KMatmulKernel / ggml dequantize_row_q6_K, auto-vectorized) feeding a
  NEON dot product (vfmaq_f32 + vaddvq_f32) behind __ARM_NEON; scalar dot on x64.
- skainet_kernels.h + CMakeLists.txt: declare/compile skainet_q6k_matmul.
- NativeKnQ6KMatmulKernel (K/N cinterop) + NativeKnKernelProvider.matmulQ6K route.
- NativeQ6KMatmulKernel (JVM FFM) + NativeKernelProvider.matmulQ6K route.
- Parity tests: NativeQ6KMatmulKernelParityTest (jvmTest vs Panama) and
  NativeKnQ6KMatmulKernelParityTest (linuxX64Test vs scalar). The JVM test runs
  on Apple Silicon with the NEON path compiled in (macOS arm64 → -march=armv8.2-a)
  — 5/5 pass, validating the NEON code path on the host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kotlinWasmStoreYarnLock fails build-job on a stale wasm lock (ws npm
transitive drift); regenerate via kotlinWasmUpgradeYarnLock so the
aggregate build-job goes green. Pre-existing failure on develop HEAD,
unrelated to the Q6_K kernel — fixing on this branch to unblock #768.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 27138a5 into develop Jun 27, 2026
6 of 7 checks passed
@michalharakal michalharakal deleted the perf/q6k-neon-kernel branch June 27, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant