perf(native-cpu): Q6_K NEON matmul kernel by michalharakal · Pull Request #768 · SKaiNET-developers/SKaiNET

michalharakal · 2026-06-26T20:18:28Z

What

Adds a Q6_K NEON matmul kernel to skainet-backend-native-cpu, binding the last quant format that fell back to scalar on the Kotlin/Native board path. Q4_K / Q5_K / Q8_0 / Q4_0 were already bound; TinyLlama's 21 Q6_K tensors were the remaining scalar fallback.

How

native/src/q6k_matmul.c — canonical 256-element / 210-byte Q6_K super-block (ql low nibbles + qh high-2-bit plane + 16 int8 scales + FP16 d), dequant d * scale * (code - 32). The 6-bit bit-assembly is an exact transcription of ScalarQ6_KMatmulKernel / ggml dequantize_row_q6_K (auto-vectorized under -O3), feeding a NEON dot product (vfmaq_f32 + vaddvq_f32) behind __ARM_NEON; scalar dot on x64.
skainet_kernels.h + CMakeLists.txt: declare/compile skainet_q6k_matmul.
NativeKnQ6KMatmulKernel (K/N cinterop) + NativeKnKernelProvider.matmulQ6K() route.
NativeQ6KMatmulKernel (JVM FFM) + NativeKernelProvider.matmulQ6K() route.

Validation

NativeQ6KMatmulKernelParityTest (jvmTest, vs PanamaVectorQ6_KMatmulKernel) — 5/5 pass. On Apple Silicon the bundled dylib is compiled -march=armv8.2-a+fp16+dotprod, so this exercises the actual NEON code path on the host.
NativeKnQ6KMatmulKernelParityTest (linuxX64Test, vs ScalarQ6_KMatmulKernel) — compiles clean; runs on Linux/CI.

Follow-up (not in this PR)

Cross-build the aarch64 archive (-PcrossArm64=true) + on-board (SL2619 Cortex-A55) tok/s measurement — needs the cross toolchain + board hardware.

🤖 Generated with Claude Code

…back format TinyLlama's 21 Q6_K tensors fell back to scalar on the Kotlin/Native board path (only Q4_K/Q5_K/Q8_0/Q4_0 were bound). This adds the Q6_K kernel so the board (linuxArm64) NEON-accelerates them too. - native/src/q6k_matmul.c: canonical 256-elem/210-byte Q6_K super-block (ql + qh high-2-bit plane + 16 int8 scales + FP16 d), dequant d * scale * (code - 32). Scalar 6-bit bit-assembly (exact transcription of ScalarQ6_KMatmulKernel / ggml dequantize_row_q6_K, auto-vectorized) feeding a NEON dot product (vfmaq_f32 + vaddvq_f32) behind __ARM_NEON; scalar dot on x64. - skainet_kernels.h + CMakeLists.txt: declare/compile skainet_q6k_matmul. - NativeKnQ6KMatmulKernel (K/N cinterop) + NativeKnKernelProvider.matmulQ6K route. - NativeQ6KMatmulKernel (JVM FFM) + NativeKernelProvider.matmulQ6K route. - Parity tests: NativeQ6KMatmulKernelParityTest (jvmTest vs Panama) and NativeKnQ6KMatmulKernelParityTest (linuxX64Test vs scalar). The JVM test runs on Apple Silicon with the NEON path compiled in (macOS arm64 → -march=armv8.2-a) — 5/5 pass, validating the NEON code path on the host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kotlinWasmStoreYarnLock fails build-job on a stale wasm lock (ws npm transitive drift); regenerate via kotlinWasmUpgradeYarnLock so the aggregate build-job goes green. Pre-existing failure on develop HEAD, unrelated to the Q6_K kernel — fixing on this branch to unblock #768. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

michalharakal and others added 2 commits June 26, 2026 22:16

michalharakal merged commit 27138a5 into develop Jun 27, 2026
6 of 7 checks passed

michalharakal deleted the perf/q6k-neon-kernel branch June 27, 2026 12:48

michalharakal mentioned this pull request Jun 29, 2026

release: 0.33.0 — GRU, upsample2d Bilinear export, autodiff coverage fix #775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(native-cpu): Q6_K NEON matmul kernel#768

perf(native-cpu): Q6_K NEON matmul kernel#768
michalharakal merged 2 commits into
developfrom
perf/q6k-neon-kernel

michalharakal commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michalharakal commented Jun 26, 2026

What

How

Validation

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant