perf(native-cpu): Q6_K NEON matmul kernel#768
Merged
Conversation
…back format TinyLlama's 21 Q6_K tensors fell back to scalar on the Kotlin/Native board path (only Q4_K/Q5_K/Q8_0/Q4_0 were bound). This adds the Q6_K kernel so the board (linuxArm64) NEON-accelerates them too. - native/src/q6k_matmul.c: canonical 256-elem/210-byte Q6_K super-block (ql + qh high-2-bit plane + 16 int8 scales + FP16 d), dequant d * scale * (code - 32). Scalar 6-bit bit-assembly (exact transcription of ScalarQ6_KMatmulKernel / ggml dequantize_row_q6_K, auto-vectorized) feeding a NEON dot product (vfmaq_f32 + vaddvq_f32) behind __ARM_NEON; scalar dot on x64. - skainet_kernels.h + CMakeLists.txt: declare/compile skainet_q6k_matmul. - NativeKnQ6KMatmulKernel (K/N cinterop) + NativeKnKernelProvider.matmulQ6K route. - NativeQ6KMatmulKernel (JVM FFM) + NativeKernelProvider.matmulQ6K route. - Parity tests: NativeQ6KMatmulKernelParityTest (jvmTest vs Panama) and NativeKnQ6KMatmulKernelParityTest (linuxX64Test vs scalar). The JVM test runs on Apple Silicon with the NEON path compiled in (macOS arm64 → -march=armv8.2-a) — 5/5 pass, validating the NEON code path on the host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kotlinWasmStoreYarnLock fails build-job on a stale wasm lock (ws npm transitive drift); regenerate via kotlinWasmUpgradeYarnLock so the aggregate build-job goes green. Pre-existing failure on develop HEAD, unrelated to the Q6_K kernel — fixing on this branch to unblock #768. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a Q6_K NEON matmul kernel to
skainet-backend-native-cpu, binding the last quant format that fell back to scalar on the Kotlin/Native board path. Q4_K / Q5_K / Q8_0 / Q4_0 were already bound; TinyLlama's 21 Q6_K tensors were the remaining scalar fallback.How
native/src/q6k_matmul.c— canonical 256-element / 210-byte Q6_K super-block (qllow nibbles +qhhigh-2-bit plane + 16 int8scales+ FP16d), dequantd * scale * (code - 32). The 6-bit bit-assembly is an exact transcription ofScalarQ6_KMatmulKernel/ ggmldequantize_row_q6_K(auto-vectorized under-O3), feeding a NEON dot product (vfmaq_f32+vaddvq_f32) behind__ARM_NEON; scalar dot on x64.skainet_kernels.h+CMakeLists.txt: declare/compileskainet_q6k_matmul.NativeKnQ6KMatmulKernel(K/N cinterop) +NativeKnKernelProvider.matmulQ6K()route.NativeQ6KMatmulKernel(JVM FFM) +NativeKernelProvider.matmulQ6K()route.Validation
NativeQ6KMatmulKernelParityTest(jvmTest, vsPanamaVectorQ6_KMatmulKernel) — 5/5 pass. On Apple Silicon the bundled dylib is compiled-march=armv8.2-a+fp16+dotprod, so this exercises the actual NEON code path on the host.NativeKnQ6KMatmulKernelParityTest(linuxX64Test, vsScalarQ6_KMatmulKernel) — compiles clean; runs on Linux/CI.Follow-up (not in this PR)
-PcrossArm64=true) + on-board (SL2619 Cortex-A55) tok/s measurement — needs the cross toolchain + board hardware.🤖 Generated with Claude Code