diff --git a/CHANGELOG.md b/CHANGELOG.md index 5d79cc3d..d688dc81 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,110 @@ version line is kept in lock-step with the underlying SKaiNET engine The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.30.0] — 2026-06-14 + +Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers +tracked the engine internally across that window (the in-progress Q5_K kernel +shipped as a local `0.29.1`) without a tagged release. The headline is +**Q5_K stays packed in the eager Gemma runtime** and the **Gemma +`NATIVE_OPTIMIZED` packed-weight path is now Kotlin/Native–ready** — the board +binary can keep K-quant weights packed without the JVM's `java.lang.foreign` +MemSeg path. + +### Added + +- **Q5_K packed in-kernel dequant in the eager Gemma runtime.** FunctionGemma-270M + ships as `Q5_K_M`, but `GemmaMemSegConverter` previously dequantized Q5_K + weights to FP32 on load ("no native matmul kernel yet for Q5_K"), giving up + both the memory saving and the in-kernel dequant. SKaiNET 0.30.0 provides a + first-class Q5_K packed matmul (`Q5_KBlockTensorData` + `Q5KMatmulKernel`: + scalar / Panama / native), so the converter now relayouts the GGUF bytes to + block-major and wraps them as `Q5_KBlockTensorData` (176 B/block). Dispatch and + the lazy transpose reach the kernel through `DefaultCpuOps`. Verified by + `GemmaQ5KPackedParityTest` (`-PincludeIntegration`): the Q5_K packed path + decodes FunctionGemma byte-identically to the FP32 baseline — + `[262146, 236769, 3255, 718, 498, 1373, 262152, 106]` → + `(state="on")` for *"Turn the light on."* +- **Kotlin/Native–ready Gemma packed-weight path.** The `NATIVE_OPTIMIZED` + packed conversion was `jvmMain`-only (it built `MemSeg`/`Arena`-backed tensors + via `java.lang.foreign`), so the Kotlin/Native board binary couldn't keep + K-quant weights packed. The platform-neutral pieces now live in `commonMain`: + - **`GemmaQuantLayout.kt`** (`commonMain`) — `logicalShapeFor`, + `relayoutKSeriesRowMajorToBlockMajor` (KMP-safe `copyInto`), and + `packGemmaKQuant()`, which builds heap-packed Q4_K/Q5_K/Q6_K + `BlockTensorData` directly with no `MemSeg`/`Arena`. + - **`GemmaPackedWeights.kt`** (`commonMain`) — `convertGemmaWeightsPacked` + packs Q4/Q5/Q6_K matmul weights to heap `Q*_KBlockTensorData`, dequants + `token_embd`/`output` to FP32 (gathered, no transpose) and any other quant + type to FP32 `[out, in]`. `extractRawBytes` reads the loader's bytes back + across both backings (JVM `IntArrayTensorData` / native `Byte`-typed). + - **`GemmaNetworkLoader.load()`** now runs `convertGemmaWeightsPacked` before + `applyWeightsToNetwork` under `NATIVE_OPTIMIZED`, so `load(NATIVE_OPTIMIZED)` + yields a runnable network on the board *and* the JVM (previously it could not + be built from raw-byte weights at all). `GemmaMemSegConverter` (`jvmMain`) + now shares the `commonMain` helpers; only the `MemSeg`/FFM conversion and the + FP32 fallbacks stay JVM-only. + Verified on JVM and `linuxX64` (`GemmaQuantLayoutTest`): relayout, packing, and + the native byte-extraction round-trip run on every target, and + `GemmaQ5KPackedParityTest` confirms all three paths (FP32 baseline, `jvmMain` + MemSeg-packed, `load()` packed) produce the identical token sequence. + +### Changed + +- **`gradle/libs.versions.toml` `skainet` pin: 0.28.1 → 0.30.0.** Picks up the + released Q5_K packed matmul, the NEON native kernels, and the Kotlin/Native + cinterop. Downstream consumers get the upstream SKaiNET BOM transparently via + `:llm-bom`, so no per-consumer migration is needed. +- **`gradle.properties` `VERSION_NAME=0.30.0`.** Lock-step with the engine. +- **`settings.gradle.kts` reverts the `mavenLocal()`-first dev shim.** The + ordering added while consuming the in-progress local SKaiNET `0.29.1` is no + longer needed now that 0.30.0 is on Maven Central; the release resolves the + engine purely from Central. The opt-in `-PuseLocalSkainet` composite build is + unchanged for local engine work. + +### Fixed + +- **`fix(gemma): dequant kernel-less quant types in `NATIVE_OPTIMIZED` instead of + leaving raw bytes`.** Loading a Gemma GGUF whose attention/FFN weights used a + quant type with no packed SIMD kernel (e.g. Q5_1) under + `QuantPolicy.NATIVE_OPTIMIZED` crashed at the first decode step + (`Transpose requires at least 2 dimensions` in `MultiHeadAttention` → + `linearProject`): `GemmaMemSegConverter.convertOne` left every unhandled quant + type as raw 1-D bytes. Kernel-less types now dequantize to a correct FP32 + `[out, in]` weight via a new `dequantPackedToFp32` helper (mirroring the proven + `Gemma4WeightLoader.createTensor` column-major → row-major transpose). The + supported packed types (Q4_0/Q8_0/Q4_K/Q6_K) keep their fast SIMD form; only + kernel-less types pay the FP32 dequant. +- **`fix(llama): dequantize Q4_1 (and all non-packed quant types) in + `DecoderGgufMemSegConverter``.** The converter handled only Q4_0/Q8_0 (packed) + and Q4_K/Q5_K/Q6_K (dequant); every other quant type fell through an `else` + branch that logged a warning and passed the raw quant bytes through unchanged, + crashing deep inside matmul (e.g. `unsupported quant type Q4_1 for + blk.0.ffn_down.weight` on Q4_1 Qwen3 models). The `else` branch now routes + through `DequantOps.dequantFromBytes` to FP32, covering Q4_1, Q5_0, Q5_1, Q8_1, + IQ4_NL/XS, TQ1/2_0, etc.; genuinely unknown types now fail explicitly at load + time instead of crashing later inside matmul. Closes + [#654](https://github.com/SKaiNET-developers/SKaiNET-transformers/issues/654). + +### Tests / CI + +- **`GemmaQ5KPackedParityTest`** — byte-identical decode parity across the FP32 + baseline, the `jvmMain` MemSeg-packed path, and the `load(NATIVE_OPTIMIZED)` + `commonMain` packed path. +- **`GemmaQuantLayoutTest`** (`commonTest`) — block-transpose relayout, packing, + and the byte-extraction round-trip; runs on JVM and `linuxX64`. +- **`DecoderGgufMemSegConverterTest`** — regression that a Q4_1 weight is + dequantized to its logical 2-D FP32 shape rather than passed through as 1-D + bytes. +- **`fix(gemma): macosArm64 target for `gemma-iree``** and CI parity fixes: + MLIR-dump tests write to a portable build dir instead of a hardcoded local + path; browser Mocha gets a 60 s timeout (parity with the engine repo). +- **`test(gemma): repoint stale FunctionGemma GGUF path`** — six real-model + integration tests now point at the in-repo + `sl2610-function-calling/models/` location, matching + `GemmaQ5KPackedParityTest`; all pass against the published SKaiNET 0.30.0 + (`-PincludeIntegration`). + ## [0.28.1] — 2026-06-06 Version-aligned with **SKaiNET 0.28.1**. Skips 0.26.x / 0.27.x — @@ -385,6 +489,8 @@ Version-aligned with **SKaiNET 0.21.0**. Last published transformers release before the engine-aligned version line. See `git log v0.16.0..0.18.0` for details. +[0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0 +[0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1 [0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1 [0.21.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.21.1 [0.21.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.21.0 diff --git a/README.md b/README.md index f5901dd0..a2d7681d 100644 --- a/README.md +++ b/README.md @@ -103,22 +103,21 @@ Honest status — see the project-status note at the top of this README. ## Current release -The current release is **0.28.1** — version-aligned with **SKaiNET 0.28.1**. -Skips 0.26.x / 0.27.x: SKaiNET-transformers tracked the engine internally across -that window without a tagged release. The headline is that the engine's -**Kotlin DSL → StableHLO → IREE export path is now complete** — a full gemma3 -graph traces and lowers to StableHLO that `iree-compile`s to a `vmfb` -(`GemmaMlirDumpTest` / `GemmaTraceTest` are green against 0.28.1). SKaiNET -0.28.0/0.28.1 fixed the remaining export bugs: result-type inference for -`reshape`/`matmul`/`concatenate` ([#673](https://github.com/SKaiNET-developers/SKaiNET/issues/673)) -and `conv1d`/`gather`/pooling/`flatten` shapes plus the `reduce_window` emission -form ([#675](https://github.com/SKaiNET-developers/SKaiNET/issues/675)). +The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**. +Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that +window without a tagged release. The headline is that **Q5_K weights now stay +packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K +packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now +**Kotlin/Native–ready** — the board binary can keep K-quant weights packed +without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`) +decodes byte-identically across the FP32 baseline and both packed paths +(`GemmaQ5KPackedParityTest`). The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place. ```kotlin dependencies { - implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1")) + implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0")) // Versions resolved from the BOM: implementation("sk.ainet.transformers:skainet-transformers-core") @@ -195,6 +194,27 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference. +## What's new in 0.30.0 + +- **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to + dequantize Q5_K weights to FP32 on load; SKaiNET 0.30.0 provides a first-class + Q5_K packed matmul (`Q5_KBlockTensorData` + `Q5KMatmulKernel`), so the converter + now relayouts the GGUF bytes to block-major and keeps them packed (176 B/block). + FunctionGemma-270M (`Q5_K_M`) decodes byte-identically to the FP32 baseline + (`GemmaQ5KPackedParityTest`). +- **Gemma `NATIVE_OPTIMIZED` path is Kotlin/Native–ready.** The reusable layout + + packing helpers (`GemmaQuantLayout.kt`, `GemmaPackedWeights.kt`) moved to + `commonMain`, and `GemmaNetworkLoader.load()` now runs `convertGemmaWeightsPacked` + under `NATIVE_OPTIMIZED` — so the board binary keeps K-quant weights packed with + no `java.lang.foreign` MemSeg dependency. Verified on JVM and `linuxX64`. +- **Engine pin `skainet 0.28.1 → 0.30.0`** — released Q5_K packed matmul, NEON + native kernels, and Kotlin/Native cinterop. The `mavenLocal()`-first dev shim is + reverted; the release resolves the engine from Maven Central. +- **Fixes.** Kernel-less quant types under `NATIVE_OPTIMIZED` now dequant to FP32 + `[out, in]` instead of crashing on a rank-1 transpose; `DecoderGgufMemSegConverter` + dequantizes Q4_1 and every other non-packed quant type instead of passing raw + bytes through to a matmul crash ([#654](https://github.com/SKaiNET-developers/SKaiNET-transformers/issues/654)). + ## What's new in 0.28.1 - **Engine pin `skainet 0.27.0 → 0.28.1`.** Picks up the completed Kotlin DSL → diff --git a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc index d5e51c88..87548dcf 100644 --- a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc +++ b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc @@ -25,7 +25,7 @@ In your `build.gradle.kts`: [source,kotlin] ---- dependencies { - implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1")) + implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0")) implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama") implementation("sk.ainet.transformers:skainet-transformers-agent") @@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts): sk.ainet.transformers skainet-transformers-bom - 0.28.1 + 0.30.0 pom import diff --git a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc index 710da06b..07f123c7 100644 --- a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc +++ b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc @@ -52,7 +52,7 @@ The pieces you need live in three modules: [source,kotlin] ---- dependencies { - implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1")) + implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0")) implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama") implementation("sk.ainet.transformers:skainet-transformers-agent") diff --git a/gradle.properties b/gradle.properties index 7efd6ccd..1987d82c 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,5 +1,5 @@ GROUP=sk.ainet.transformers -VERSION_NAME=0.28.1 +VERSION_NAME=0.30.0 POM_DESCRIPTION=SKaiNET-transformers diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml index 66e7fb68..5aa078ed 100644 --- a/gradle/libs.versions.toml +++ b/gradle/libs.versions.toml @@ -1,5 +1,5 @@ [versions] -skainet = "0.28.1" +skainet = "0.30.0" agp = "9.2.1" jacksonDatabind = "2.22.0" jsonSchemaValidator = "3.0.3" diff --git a/llm-agent/api/jvm/llm-agent.api b/llm-agent/api/jvm/llm-agent.api index edde6a76..54b8610a 100644 --- a/llm-agent/api/jvm/llm-agent.api +++ b/llm-agent/api/jvm/llm-agent.api @@ -1,6 +1,6 @@ public final class sk/ainet/apps/kllama/agent/GenerateExtensionsKt { - public static final fun generateUntilStop (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;)Lsk/ainet/apps/kllama/agent/GenerateResult; - public static synthetic fun generateUntilStop$default (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;ILjava/lang/Object;)Lsk/ainet/apps/kllama/agent/GenerateResult; + public static final fun generateUntilStop (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function2;)Lsk/ainet/apps/kllama/agent/GenerateResult; + public static synthetic fun generateUntilStop$default (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function2;ILjava/lang/Object;)Lsk/ainet/apps/kllama/agent/GenerateResult; public static final fun sampleFromLogits (Lsk/ainet/lang/tensor/Tensor;FLkotlin/random/Random;)I public static synthetic fun sampleFromLogits$default (Lsk/ainet/lang/tensor/Tensor;FLkotlin/random/Random;ILjava/lang/Object;)I } @@ -45,6 +45,7 @@ public final class sk/ainet/apps/kllama/chat/AgentConfig { public abstract interface class sk/ainet/apps/kllama/chat/AgentListener { public fun onAssistantMessage (Ljava/lang/String;)V public fun onComplete (Ljava/lang/String;)V + public fun onPrefillProgress (II)V public fun onThinking (Ljava/lang/String;)V public fun onToken (Ljava/lang/String;)V public fun onToolCallValidationFailed (Lsk/ainet/apps/kllama/chat/ToolCall;Ljava/lang/String;)V @@ -55,6 +56,7 @@ public abstract interface class sk/ainet/apps/kllama/chat/AgentListener { public final class sk/ainet/apps/kllama/chat/AgentListener$DefaultImpls { public static fun onAssistantMessage (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V public static fun onComplete (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V + public static fun onPrefillProgress (Lsk/ainet/apps/kllama/chat/AgentListener;II)V public static fun onThinking (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V public static fun onToken (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V public static fun onToolCallValidationFailed (Lsk/ainet/apps/kllama/chat/AgentListener;Lsk/ainet/apps/kllama/chat/ToolCall;Ljava/lang/String;)V diff --git a/llm-core/api/jvm/llm-core.api b/llm-core/api/jvm/llm-core.api index 5d72b5a3..aecfb28d 100644 --- a/llm-core/api/jvm/llm-core.api +++ b/llm-core/api/jvm/llm-core.api @@ -543,8 +543,8 @@ public final class sk/ainet/lang/nn/dsl/ATTENTION$DefaultImpls { } public final class sk/ainet/lang/nn/dsl/AttentionImpl : sk/ainet/lang/nn/dsl/ATTENTION { - public fun (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;)V - public synthetic fun (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;ILkotlin/jvm/internal/DefaultConstructorMarker;)V + public fun (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;Lkotlin/reflect/KClass;)V + public synthetic fun (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V public final fun create ()Lsk/ainet/lang/nn/transformer/MultiHeadAttention; public fun getExecutionContext ()Lsk/ainet/context/ExecutionContext; public fun kvCache (III)V @@ -653,8 +653,8 @@ public abstract interface class sk/ainet/lang/nn/normalization/FusedRmsNormOps { } public final class sk/ainet/lang/nn/normalization/RMSNormalization : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters { - public fun ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;Z)V - public synthetic fun ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;ZILkotlin/jvm/internal/DefaultConstructorMarker;)V + public fun ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;ZLkotlin/reflect/KClass;)V + public synthetic fun ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;ZLkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V public fun forward (Lsk/ainet/lang/tensor/Tensor;Lsk/ainet/context/ExecutionContext;)Lsk/ainet/lang/tensor/Tensor; public fun getModules ()Ljava/util/List; public fun getName ()Ljava/lang/String; @@ -670,8 +670,8 @@ public final class sk/ainet/lang/nn/transformer/AppendKVCache : sk/ainet/lang/nn } public final class sk/ainet/lang/nn/transformer/GeGLUFFN : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters { - public fun (IILjava/lang/String;)V - public synthetic fun (IILjava/lang/String;ILkotlin/jvm/internal/DefaultConstructorMarker;)V + public fun (IILjava/lang/String;Lkotlin/reflect/KClass;)V + public synthetic fun (IILjava/lang/String;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V public final fun getDim ()I public final fun getHiddenDim ()I public fun getModules ()Ljava/util/List; @@ -695,8 +695,8 @@ public abstract class sk/ainet/lang/nn/transformer/KVCache : sk/ainet/lang/nn/Mo public final class sk/ainet/lang/nn/transformer/LayerScalarMul : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters { public fun ()V - public fun (Ljava/lang/String;)V - public synthetic fun (Ljava/lang/String;ILkotlin/jvm/internal/DefaultConstructorMarker;)V + public fun (Ljava/lang/String;Lkotlin/reflect/KClass;)V + public synthetic fun (Ljava/lang/String;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V public fun getModules ()Ljava/util/List; public fun getName ()Ljava/lang/String; public fun getParams ()Ljava/util/List; @@ -707,8 +707,8 @@ public final class sk/ainet/lang/nn/transformer/LinearProjectionKt { } public final class sk/ainet/lang/nn/transformer/MultiHeadAttention : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters { - public fun (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;)V - public synthetic fun (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;ILkotlin/jvm/internal/DefaultConstructorMarker;)V + public fun (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;Lkotlin/reflect/KClass;)V + public synthetic fun (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V public final fun forward (Lsk/ainet/lang/tensor/Tensor;Lsk/ainet/lang/tensor/Tensor;Lsk/ainet/context/ExecutionContext;)Lsk/ainet/lang/tensor/Tensor; public final fun getAttentionScale ()Ljava/lang/Float; public final fun getBias ()Z @@ -847,7 +847,8 @@ public final class sk/ainet/lang/nn/transformer/SwiGLUFFN : sk/ainet/lang/nn/Mod } public final class sk/ainet/lang/nn/transformer/VoidDense : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters { - public fun (Ljava/lang/String;II)V + public fun (Ljava/lang/String;IILkotlin/reflect/KClass;)V + public synthetic fun (Ljava/lang/String;IILkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V public final fun getInDim ()I public fun getModules ()Ljava/util/List; public fun getName ()Ljava/lang/String; diff --git a/llm-inference/gemma/api/jvm/gemma.api b/llm-inference/gemma/api/jvm/gemma.api index 57fcd67f..4483f8cd 100644 --- a/llm-inference/gemma/api/jvm/gemma.api +++ b/llm-inference/gemma/api/jvm/gemma.api @@ -865,6 +865,10 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt { public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module; } +public final class sk/ainet/models/gemma/GemmaPackedWeightsKt { + public static final fun convertGemmaWeightsPacked (Lsk/ainet/models/gemma/Gemma4Weights;Lsk/ainet/context/ExecutionContext;)Lsk/ainet/models/gemma/Gemma4Weights; +} + public final class sk/ainet/models/gemma/GemmaPerLayerTokenEmbedTensorData : sk/ainet/lang/tensor/data/TensorData, sk/ainet/models/gemma/RowDequantSource { public fun (Lsk/ainet/lang/tensor/Shape;Lsk/ainet/io/gguf/GGMLQuantizationType;[B)V public fun copyToFloatArray ()[F diff --git a/llm-inference/gemma/build.gradle.kts b/llm-inference/gemma/build.gradle.kts index 24ea30d7..f541c944 100644 --- a/llm-inference/gemma/build.gradle.kts +++ b/llm-inference/gemma/build.gradle.kts @@ -88,9 +88,16 @@ kotlin { } } +// Real-model (FunctionGemma-270M) integration tests (run with -PincludeIntegration) +// dequantize ~270M params to FP32, and GemmaQ5KPackedParityTest holds the FP32 +// baseline plus both packed decode networks at once; the bake-to-irpa test holds +// weights + serialized bytes simultaneously. 8g OOMs once the real model is +// present, so default to 12g — override via -PgemmaTestMaxHeap (CI without the +// model file self-skips these and never needs the headroom). tasks.withType().configureEach { jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector") - maxHeapSize = (findProperty("gemmaTestMaxHeap") as? String) ?: "6g" + maxHeapSize = (findProperty("gemmaTestMaxHeap") as? String) ?: "12g" + (findProperty("seqLen") as? String)?.let { systemProperty("seqLen", it) } } // Kotlin/JS + Kotlin/WASM browser test runners have two separate problems on @@ -109,11 +116,3 @@ tasks.matching { it.name == "jsBrowserTest" || it.name == "wasmJsBrowserTest" }. ?.failOnNoDiscoveredTests = false enabled = includeBrowserTests } - -// Real-model (FunctionGemma-270M) tests dequantize ~270M params to FP32 and the -// bake-to-irpa test holds weights + serialized bytes simultaneously; allow an -// override via -PgemmaTestMaxHeap (default 8g). -tasks.withType().configureEach { - maxHeapSize = (findProperty("gemmaTestMaxHeap") as? String) ?: "8g" - (findProperty("seqLen") as? String)?.let { systemProperty("seqLen", it) } -} diff --git a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt index f73b3ac8..abc8c3e3 100644 --- a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt +++ b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt @@ -122,7 +122,7 @@ public class GemmaNetworkLoader @PublishedApi internal constructor( public suspend inline fun load( ctx: ExecutionContext ): Module { - val weights: Gemma4Weights = when (val wp = weightsProvider) { + val rawWeights: Gemma4Weights = when (val wp = weightsProvider) { is WeightsProvider.GgufSource -> { val loader = Gemma4WeightLoader(wp.sourceProvider, quantPolicy = wp.quantPolicy) loader.loadToMap(ctx) @@ -142,6 +142,24 @@ public class GemmaNetworkLoader @PublishedApi internal constructor( } } + // NATIVE_OPTIMIZED yields raw-byte quant tensors the network mapper can't + // consume directly. Pack them (heap Q4/5/6_K + FP32 fallback) here — this + // is commonMain so it works on Kotlin/Native (the board) as well as the + // JVM, and replaces the JVM-only `convertGemmaWeightsToMemSeg` for the + // `load()` entry point. + val ggufPolicy = when (val wp = weightsProvider) { + is WeightsProvider.GgufSource -> wp.quantPolicy + is WeightsProvider.GgufRandomAccess -> wp.quantPolicy + else -> null + } + val weights: Gemma4Weights = + if (ggufPolicy == QuantPolicy.NATIVE_OPTIMIZED) { + @Suppress("UNCHECKED_CAST") + convertGemmaWeightsPacked(rawWeights, ctx) as Gemma4Weights + } else { + rawWeights + } + return applyWeightsToNetwork(ctx, weights) } diff --git a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaPackedWeights.kt b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaPackedWeights.kt new file mode 100644 index 00000000..ec52eb4c --- /dev/null +++ b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaPackedWeights.kt @@ -0,0 +1,125 @@ +package sk.ainet.models.gemma + +import sk.ainet.context.ExecutionContext +import sk.ainet.io.gguf.GGMLQuantizationType +import sk.ainet.io.gguf.dequant.DequantOps +import sk.ainet.lang.tensor.Shape +import sk.ainet.lang.tensor.Tensor +import sk.ainet.lang.tensor.data.IntArrayTensorData +import sk.ainet.lang.tensor.data.TensorData +import sk.ainet.lang.types.DType +import sk.ainet.lang.types.FP32 + +/** + * commonMain (Kotlin/Native-capable) analogue of the jvmMain + * `convertGemmaWeightsToMemSeg`. Converts the raw-byte quantized tensors a + * `NATIVE_OPTIMIZED` load produces into the forms the DSL matmul path consumes: + * + * - **Q4_K / Q5_K / Q6_K matmul weights** → heap-packed `Q{4,5,6}_KBlockTensorData` + * (via [packGemmaKQuant], with the row-major→block-major relayout). These keep + * the GGUF footprint and run the in-kernel dequant matmul (NEON on the board). + * - **token_embd / output** → FP32 dequant in canonical `[vocab, embed]` order + * (the embedding is gathered, not matmul'd, so no transpose). + * - **everything else quantized** → FP32 dequant transposed to `[out, in]` + * row-major so `linearProject` (`x @ W.t()`) is correct. + * + * Unlike the MemSeg converter this uses no `java.lang.foreign` — it runs on the + * SL2610 board binary (Kotlin/Native) as well as the JVM. The JVM still prefers + * the MemSeg path (lazy transpose + Q4/Q8 MemSeg); this is the board path. + */ +public fun convertGemmaWeightsPacked( + weights: Gemma4Weights<*, *>, + ctx: ExecutionContext, +): Gemma4Weights<*, *> { + @Suppress("UNCHECKED_CAST") + val typed = weights as Gemma4Weights + val quantTypes = typed.quantTypes + if (quantTypes.isEmpty()) return weights + + val logicalShapes = typed.logicalShapes + val newTensors = linkedMapOf>() + for ((name, tensor) in typed.tensors) { + val qt = quantTypes[name] + newTensors[name] = when { + qt == null -> tensor // not quantized + else -> { + val shape = logicalShapes[name] ?: logicalShapeFor(name, typed.metadata) + if (shape == null) { + tensor // unknown 2-D layout — leave as-is + } else { + val bytes = extractRawBytes(tensor.data) + val isEmbed = name == Gemma4TensorNames.TOKEN_EMBEDDINGS || + name == Gemma4TensorNames.OUTPUT_WEIGHT + val packed = if (!isEmbed) packGemmaKQuant(bytes, qt, shape) else null + when { + packed != null -> { + @Suppress("UNCHECKED_CAST") + ctx.fromData(packed as TensorData, FP32::class) as Tensor + } + isEmbed -> dequantNoTranspose(bytes, qt, shape, ctx) + else -> dequantTransposed(bytes, qt, shape, ctx) + } + } + } + } + } + @Suppress("UNCHECKED_CAST") + return Gemma4Weights(typed.metadata, newTensors, typed.quantTypes, typed.logicalShapes) as Gemma4Weights<*, *> +} + +/** Dequant to FP32 in natural `[rows, cols]` order (embeddings — gathered, not matmul'd). */ +@Suppress("UNCHECKED_CAST") +private fun dequantNoTranspose( + bytes: ByteArray, + qt: GGMLQuantizationType, + shape: Shape, + ctx: ExecutionContext, +): Tensor { + val floats = DequantOps.dequantFromBytes(bytes, qt, shape.volume) + return ctx.fromFloatArray(shape, FP32::class, floats) as Tensor +} + +/** + * Dequant to a canonical FP32 `[out, in]` row-major weight. GGUF stores K/legacy + * blocks column-major within a row, so the dequantized floats are transposed + * column-major → row-major to match what `linearProject` (`x @ W.t()`) expects. + */ +@Suppress("UNCHECKED_CAST") +private fun dequantTransposed( + bytes: ByteArray, + qt: GGMLQuantizationType, + shape: Shape, + ctx: ExecutionContext, +): Tensor { + val floats = DequantOps.dequantFromBytes(bytes, qt, shape.volume) + val out = shape[0] + val inDim = shape[1] + val rowMajor = DequantOps.transposeColumnMajorToRowMajor(floats, inDim, out) + return ctx.fromFloatArray(shape, FP32::class, rowMajor) as Tensor +} + +/** + * Read the raw packed bytes back from a `NATIVE_OPTIMIZED` quant tensor. The + * backing differs by platform/factory — JVM stores `IntArrayTensorData` (byte + * values widened to Int); Kotlin/Native stores a Byte-typed tensor — so handle + * both element types. + */ +internal fun extractRawBytes(data: TensorData<*, *>): ByteArray { + if (data is IntArrayTensorData<*>) { + val buf = data.buffer + return ByteArray(buf.size) { buf[it].toByte() } + } + val n = data.shape.volume + @Suppress("UNCHECKED_CAST") + val d = data as TensorData<*, Any?> + return ByteArray(n) { + when (val v = d[it]) { + is Byte -> v + is Int -> v.toByte() + else -> error( + "convertGemmaWeightsPacked: cannot read bytes from ${data::class.simpleName} " + + "(element ${v?.let { e -> e::class.simpleName }})", + ) + } + } +} diff --git a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaQuantLayout.kt b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaQuantLayout.kt new file mode 100644 index 00000000..7f4e7b9f --- /dev/null +++ b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaQuantLayout.kt @@ -0,0 +1,121 @@ +package sk.ainet.models.gemma + +import sk.ainet.io.gguf.GGMLQuantizationType +import sk.ainet.lang.tensor.Shape +import sk.ainet.lang.tensor.data.Q4_KBlockTensorData +import sk.ainet.lang.tensor.data.Q5_KBlockTensorData +import sk.ainet.lang.tensor.data.Q6_KBlockTensorData +import sk.ainet.lang.tensor.data.TensorData +import sk.ainet.lang.types.DType + +/** + * Platform-neutral (commonMain) layout helpers for Gemma 4 quantized weights. + * + * These were previously JVM-only (inside `GemmaMemSegConverter`), but the + * Kotlin/Native board path needs the same logic: on K/N there is no + * `java.lang.foreign` MemSeg conversion, so the eager runtime keeps K-quant + * weights as heap-packed `Q{4,5,6}_KBlockTensorData` produced here. The JVM + * MemSeg converter reuses the same relayout + shape recovery. + */ + +/** + * Recover the logical 2-D shape of a Gemma 4 weight tensor from its GGUF name + * and model metadata. `Gemma4WeightLoader` with `NATIVE_OPTIMIZED` stores + * quantized tensors as 1-D byte arrays, so converters need the original + * `[rows, cols]` shape to re-layout blocks. Returns `null` for tensors without + * a 2-D matmul layout (norms, embeddings the converter dequantizes anyway). + */ +internal fun logicalShapeFor(name: String, metadata: Gemma4ModelMetadata): Shape? { + val embed = metadata.embeddingLength + val vocab = metadata.vocabSize + return when { + name == Gemma4TensorNames.TOKEN_EMBEDDINGS -> Shape(vocab, embed) + name == Gemma4TensorNames.OUTPUT_WEIGHT -> Shape(vocab, embed) + name.startsWith("blk.") -> { + val rest = name.substringAfter("blk.") + val layer = rest.substringBefore('.').toIntOrNull() ?: return null + val headDim = metadata.getHeadDim(layer) + val qDim = metadata.headCount * headDim + val kvDim = metadata.kvHeadCount * headDim + val ffn = metadata.intermediateSize + when { + name.endsWith(".attn_q.weight") -> Shape(qDim, embed) + name.endsWith(".attn_k.weight") -> Shape(kvDim, embed) + name.endsWith(".attn_v.weight") -> Shape(kvDim, embed) + name.endsWith(".attn_output.weight") -> Shape(embed, qDim) + name.endsWith(".ffn_gate.weight") -> Shape(ffn, embed) + name.endsWith(".ffn_up.weight") -> Shape(ffn, embed) + name.endsWith(".ffn_down.weight") -> Shape(embed, ffn) + else -> null + } + } + else -> null + } +} + +/** + * Re-layout GGUF K-series bytes from row-major block order + * (`(r * blocksPerRow + b) * bytesPerBlock`) to the input-block-major order the + * `matmulQ{K}` kernels expect (`(b * outDim + r) * bytesPerBlock`). For a + * `[outDim, inDim]` weight with `inDim % 256 == 0`, this is a block-level 2-D + * transpose; bytes inside a block are untouched. + * + * @param bytesPerBlock 144 (Q4_K), 176 (Q5_K), 210 (Q6_K). + */ +internal fun relayoutKSeriesRowMajorToBlockMajor( + bytes: ByteArray, + shape: Shape, + bytesPerBlock: Int, +): ByteArray { + val blockSize = 256 + require(shape.rank == 2) { "K-series weight must be 2D, got rank ${shape.rank}" } + val outDim = shape[0] + val inDim = shape[1] + require(inDim % blockSize == 0) { "K-series weight inDim ($inDim) must be a multiple of $blockSize" } + val blocksPerRow = inDim / blockSize + val expected = outDim.toLong() * blocksPerRow.toLong() * bytesPerBlock.toLong() + require(bytes.size.toLong() >= expected) { + "K-series byte buffer ${bytes.size} < expected $expected for [$outDim, $inDim] @ ${bytesPerBlock}B/block" + } + val out = ByteArray(bytes.size) + for (r in 0 until outDim) { + for (b in 0 until blocksPerRow) { + val srcOff = (r * blocksPerRow + b) * bytesPerBlock + val dstOff = (b * outDim + r) * bytesPerBlock + bytes.copyInto(out, dstOff, srcOff, srcOff + bytesPerBlock) + } + } + return out +} + +/** Bytes per ggml block for the K-quant types this packer handles. */ +private fun kQuantBytesPerBlock(qt: GGMLQuantizationType): Int? = when (qt) { + GGMLQuantizationType.Q4_K -> 144 + GGMLQuantizationType.Q5_K -> 176 + GGMLQuantizationType.Q6_K -> 210 + else -> null +} + +/** + * Pack raw GGUF K-quant `bytes` of logical `[out, in]` shape into the + * heap-packed block tensor data the matmul kernels read directly (Q4_K / Q5_K / + * Q6_K). Performs the row-major → block-major relayout. Returns `null` for + * non-K-quant types (caller dequantizes those to FP32). + * + * commonMain → works on JVM and Kotlin/Native alike (no MemSeg / Arena). + */ +internal fun packGemmaKQuant( + bytes: ByteArray, + qt: GGMLQuantizationType, + shape: Shape, +): TensorData? { + val bpb = kQuantBytesPerBlock(qt) ?: return null + val relaid = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, bpb) + @Suppress("UNCHECKED_CAST") + return when (qt) { + GGMLQuantizationType.Q4_K -> Q4_KBlockTensorData(shape, relaid) as TensorData + GGMLQuantizationType.Q5_K -> Q5_KBlockTensorData(shape, relaid) as TensorData + GGMLQuantizationType.Q6_K -> Q6_KBlockTensorData(shape, relaid) as TensorData + else -> null + } +} diff --git a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt new file mode 100644 index 00000000..52a1cdd1 --- /dev/null +++ b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt @@ -0,0 +1,73 @@ +package sk.ainet.models.gemma + +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertNull +import kotlin.test.assertTrue +import sk.ainet.context.DirectCpuExecutionContext +import sk.ainet.io.gguf.GGMLQuantizationType +import sk.ainet.lang.tensor.Shape +import sk.ainet.lang.tensor.data.Q5_KBlockTensorData +import sk.ainet.lang.types.FP32 +import sk.ainet.lang.types.Int8 + +/** + * Unit tests for the commonMain (board-shareable) Gemma quant layout helpers. + * These run on every target (JVM + Kotlin/Native), proving the K/N board path's + * relayout + packing logic without needing the full model. + */ +class GemmaQuantLayoutTest { + + @Test + fun relayout_is_block_level_transpose() { + // [outDim=2, inDim=512] -> blocksPerRow=2, 4 Q5_K blocks of 176 B. + val bpb = 176 + val outDim = 2 + val inDim = 512 + val blocksPerRow = inDim / 256 + val bytes = ByteArray(outDim * blocksPerRow * bpb) + // Tag each source block with its row-major index in its first byte. + for (i in 0 until outDim * blocksPerRow) bytes[i * bpb] = i.toByte() + + val relaid = relayoutKSeriesRowMajorToBlockMajor(bytes, Shape(outDim, inDim), bpb) + + // dst block (b*outDim + r) must hold src block (r*blocksPerRow + b). + for (r in 0 until outDim) { + for (b in 0 until blocksPerRow) { + val srcIdx = r * blocksPerRow + b + val dstIdx = b * outDim + r + assertEquals(srcIdx.toByte(), relaid[dstIdx * bpb], "block ($r,$b) misplaced") + } + } + } + + @Test + fun pack_q5k_produces_block_tensor_with_relaid_bytes() { + val shape = Shape(2, 512) + val bytes = ByteArray(2 * 2 * 176) + for (i in 0 until 4) bytes[i * 176] = (i + 1).toByte() + + val td = packGemmaKQuant(bytes, GGMLQuantizationType.Q5_K, shape) + assertTrue(td is Q5_KBlockTensorData, "Q5_K should pack to Q5_KBlockTensorData") + // packedData is the block-major relayout of the input. + val expected = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, 176) + assertTrue(expected.contentEquals(td.packedData)) + } + + @Test + fun pack_non_kquant_returns_null() { + assertNull(packGemmaKQuant(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32))) + } + + @Test + fun extract_raw_bytes_roundtrips_on_every_platform() { + // The NATIVE_OPTIMIZED loader wraps quant bytes via ctx.fromByteArray; + // extractRawBytes must read them back regardless of the platform backing + // (JVM IntArrayTensorData vs native Byte-typed). Runs on jvm + linuxX64. + val ctx = DirectCpuExecutionContext.create() + val bytes = ByteArray(176 * 3) { ((it * 31 + 7) and 0xFF).toByte() } + val t = ctx.fromByteArray(Shape(bytes.size), Int8::class, bytes) + val got = extractRawBytes(t.data) + assertTrue(bytes.contentEquals(got), "extractRawBytes round-trip mismatch") + } +} diff --git a/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt b/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt index d3a4502f..191f2510 100644 --- a/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt +++ b/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt @@ -8,6 +8,7 @@ import sk.ainet.lang.tensor.Shape import sk.ainet.lang.tensor.Tensor import sk.ainet.lang.tensor.data.IntArrayTensorData import sk.ainet.lang.tensor.data.Q4_KBlockTensorData +import sk.ainet.lang.tensor.data.Q5_KBlockTensorData import sk.ainet.lang.tensor.data.Q6_KBlockTensorData import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData @@ -15,44 +16,9 @@ import sk.ainet.lang.tensor.data.TensorData import sk.ainet.lang.types.DType import sk.ainet.lang.types.FP32 -/** - * Recover the logical 2-D shape of a Gemma 4 weight tensor from its GGUF - * name and the model metadata. `Gemma4WeightLoader` with - * `NATIVE_OPTIMIZED` stores quantized tensors as 1-D byte arrays so the - * tensor-data factory accepts them; the converter needs the original - * shape to re-layout blocks and construct `Q4_KBlockTensorData` / - * `Q4/Q8MemorySegmentTensorData`. - * - * Returns `null` for tensors that don't have a 2-D matmul layout (norms, - * embeddings the converter wants to dequant anyway). - */ -internal fun logicalShapeFor(name: String, metadata: Gemma4ModelMetadata): Shape? { - val embed = metadata.embeddingLength - val vocab = metadata.vocabSize - return when { - name == Gemma4TensorNames.TOKEN_EMBEDDINGS -> Shape(vocab, embed) - name == Gemma4TensorNames.OUTPUT_WEIGHT -> Shape(vocab, embed) - name.startsWith("blk.") -> { - val rest = name.substringAfter("blk.") - val layer = rest.substringBefore('.').toIntOrNull() ?: return null - val headDim = metadata.getHeadDim(layer) - val qDim = metadata.headCount * headDim - val kvDim = metadata.kvHeadCount * headDim - val ffn = metadata.intermediateSize - when { - name.endsWith(".attn_q.weight") -> Shape(qDim, embed) - name.endsWith(".attn_k.weight") -> Shape(kvDim, embed) - name.endsWith(".attn_v.weight") -> Shape(kvDim, embed) - name.endsWith(".attn_output.weight") -> Shape(embed, qDim) - name.endsWith(".ffn_gate.weight") -> Shape(ffn, embed) - name.endsWith(".ffn_up.weight") -> Shape(ffn, embed) - name.endsWith(".ffn_down.weight") -> Shape(embed, ffn) - else -> null - } - } - else -> null - } -} +// logicalShapeFor + relayoutKSeriesRowMajorToBlockMajor moved to commonMain +// (GemmaQuantLayout.kt) so the Kotlin/Native board path shares them. This +// JVM-only file keeps the MemSeg (FFM) conversion + the FP32 dequant fallbacks. /** * Convert raw-byte quantized tensors in a [Gemma4Weights] map (produced by @@ -197,8 +163,14 @@ private fun convertOne( ctx.fromData(data as TensorData, advertisedDtype) as Tensor } GGMLQuantizationType.Q5_K -> { - // No native matmul kernel yet for Q5_K. Fall back to a correct FP32 dequant. - dequantPackedToFp32(bytes, qt, shape, ctx) + // Same packed-path treatment as Q4_K/Q6_K, enabled by the Q5_K + // matmul kernel (scalar/Panama/native) + the lazy Q5_K transpose + // in DefaultCpuOps. FunctionGemma-270M Q5_K_M ships most attn/FFN + // weights as Q5_K, so keeping them packed (176 B/block) avoids the + // FP32 inflation and runs the in-kernel dequant matmul. + val relaid = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, 176) + val data = Q5_KBlockTensorData.fromRawBytes(shape, relaid) + ctx.fromData(data as TensorData, advertisedDtype) as Tensor } else -> { // Any other quant type without a packed SIMD kernel (Q5_0/Q5_1/Q4_1/Q2_K/…) @@ -280,53 +252,9 @@ private fun dequantToFloat( } /** - * Re-layout GGUF K-series bytes from row-major block order (block at row r, - * block index b within row → byte offset `(r * blocksPerRow + b) * bytesPerBlock`) - * to the input-block-major layout the `matmulQ{K}_Vec` kernels expect - * (block at blockIdx bI for output row r → byte offset - * `(bI * outDim + r) * bytesPerBlock`). - * - * For a weight of shape `[outDim, inDim]` with `inDim % 256 == 0` (the - * K-series block size), this is just a 2D block-level transpose of the - * `[outDim, inDim/256]` array of `bytesPerBlock`-byte blocks. Bytes - * inside a block are untouched. - * - * @param bytes packed weight bytes in row-major [outDim, blocksPerRow] order - * @param shape logical `[outDim, inDim]` shape - * @param bytesPerBlock 144 for Q4_K, 210 for Q6_K (ggml block sizes) - */ -internal fun relayoutKSeriesRowMajorToBlockMajor( - bytes: ByteArray, - shape: sk.ainet.lang.tensor.Shape, - bytesPerBlock: Int -): ByteArray { - val blockSize = 256 - require(shape.rank == 2) { "K-series weight must be 2D, got rank ${shape.rank}" } - val outDim = shape[0] - val inDim = shape[1] - require(inDim % blockSize == 0) { - "K-series weight inDim ($inDim) must be a multiple of $blockSize" - } - val blocksPerRow = inDim / blockSize - val expected = outDim.toLong() * blocksPerRow.toLong() * bytesPerBlock.toLong() - require(bytes.size.toLong() >= expected) { - "K-series byte buffer size ${bytes.size} < expected $expected for shape [$outDim, $inDim] @ ${bytesPerBlock}B/block" - } - val out = ByteArray(bytes.size) - for (r in 0 until outDim) { - for (b in 0 until blocksPerRow) { - val srcOff = (r * blocksPerRow + b) * bytesPerBlock - val dstOff = (b * outDim + r) * bytesPerBlock - System.arraycopy(bytes, srcOff, out, dstOff, bytesPerBlock) - } - } - return out -} - -/** - * Back-compat shim that delegates to [relayoutKSeriesRowMajorToBlockMajor] - * at Q4_K's 144-byte block size. Kept for any callers outside this file - * pinned to the old name. + * Back-compat shim that delegates to the commonMain + * [relayoutKSeriesRowMajorToBlockMajor] at Q4_K's 144-byte block size. Kept for + * any callers outside this file pinned to the old name. */ internal fun relayoutQ4_KRowMajorToBlockMajor(bytes: ByteArray, shape: sk.ainet.lang.tensor.Shape): ByteArray = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, 144) diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt index 406197c6..3f938609 100644 --- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt @@ -31,7 +31,7 @@ import kotlin.test.assertEquals */ @Tag("integration") class GemmaBehavioralAbTest { - private val gguf = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + private val gguf = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" private fun argmax(a: FloatArray): Int { var bi = 0; var bv = a[0] diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaQ5KPackedParityTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaQ5KPackedParityTest.kt new file mode 100644 index 00000000..1d4a7ad4 --- /dev/null +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaQ5KPackedParityTest.kt @@ -0,0 +1,142 @@ +package sk.ainet.models.gemma + +import java.io.File +import java.lang.foreign.Arena +import kotlinx.coroutines.runBlocking +import kotlinx.io.buffered +import kotlinx.io.files.Path +import kotlinx.io.files.SystemFileSystem +import org.junit.jupiter.api.Assumptions +import org.junit.jupiter.api.Tag +import sk.ainet.apps.llm.OptimizedLLMMode +import sk.ainet.apps.llm.OptimizedLLMRuntime +import sk.ainet.apps.llm.tokenizer.GGUFTokenizer +import sk.ainet.context.DirectCpuExecutionContext +import sk.ainet.io.JvmRandomAccessSource +import sk.ainet.io.model.QuantPolicy +import sk.ainet.lang.types.FP32 +import kotlin.test.Test +import kotlin.test.assertEquals + +/** + * End-to-end check that the NEW Q5_K packed in-kernel dequant path (upstream + * SKaiNET `Q5_KBlockTensorData` + `Q5KMatmulKernel`, wired here via + * [convertGemmaWeightsToMemSeg]) decodes FunctionGemma-270M (`Q5_K_M`) + * identically to the FP32-dequant baseline, and reports tokens/sec. + * + * Before this, the converter dequantized Q5_K weights to FP32 on load ("no + * native matmul kernel yet for Q5_K"). Now Q5_K stays packed (176 B/block) + * and runs the in-kernel dequant matmul. Both paths decode the same weights, + * so greedy argmax token sequences must match. + * + * Skips when the GGUF isn't present (CI without the checkpoint). + */ +@Tag("integration") +class GemmaQ5KPackedParityTest { + + private val gguf = + "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + + private fun argmax(a: FloatArray): Int { + var bi = 0; var bv = a[0] + for (i in 1 until a.size) if (a[i] > bv) { bv = a[i]; bi = i } + return bi + } + + private fun buildPrompt(u: String) = + "user\n$u\nmodel\n" + + private fun decode( + runtime: OptimizedLLMRuntime, + promptTokens: List, + maxNew: Int, + eos: Int, + eot: Int, + ): List { + runtime.reset() + var logits = FloatArray(0) + for (t in promptTokens) logits = runtime.forward(t).data.copyToFloatArray() + val gen = mutableListOf() + while (gen.size < maxNew) { + val next = argmax(logits) + gen.add(next) + if (next == eos || next == eot) break + logits = runtime.forward(next).data.copyToFloatArray() + } + return gen + } + + @Test + fun q5kPackedMatchesFp32() = runBlocking { + Assumptions.assumeTrue(File(gguf).exists(), "FunctionGemma GGUF not present — skipping") + + val ctx = DirectCpuExecutionContext.create() + val tokenizer = GGUFTokenizer.fromSource(SystemFileSystem.source(Path(gguf)).buffered()) + val eot = tokenizer.encode("").single() + val eos = tokenizer.eosTokenId + val promptTokens = + listOf(tokenizer.bosTokenId) + tokenizer.encode(buildPrompt("Turn the light on.")).toList() + val maxNew = 12 + + // --- FP32 dequant-on-load baseline --- + val wFp32 = Gemma4WeightLoader( + randomAccessProvider = { JvmRandomAccessSource.open(gguf) }, + quantPolicy = QuantPolicy.DEQUANTIZE_TO_FP32, + ).loadToMapStreaming(ctx, FP32::class) + val mFp32 = GemmaNetworkLoader.fromWeights(ctx, wFp32, FP32::class) + val rtFp32 = OptimizedLLMRuntime( + model = mFp32, ctx = ctx, mode = OptimizedLLMMode.DIRECT, + dtype = FP32::class, bos = tokenizer.bosTokenId, + ) + val genFp32 = decode(rtFp32, promptTokens, maxNew, eos, eot) + + // --- Q5_K packed in-kernel dequant path (NATIVE_OPTIMIZED + convert) --- + Arena.ofConfined().use { arena -> + val wNat = Gemma4WeightLoader( + randomAccessProvider = { JvmRandomAccessSource.open(gguf) }, + quantPolicy = QuantPolicy.NATIVE_OPTIMIZED, + ).loadToMapStreaming(ctx, FP32::class) + val wConv = convertGemmaWeightsToMemSeg(wNat, ctx, arena) + @Suppress("UNCHECKED_CAST") + val mNat = GemmaNetworkLoader.fromWeights( + ctx, wConv as Gemma4Weights, FP32::class, + ) + val rtNat = OptimizedLLMRuntime( + model = mNat, ctx = ctx, mode = OptimizedLLMMode.DIRECT, + dtype = FP32::class, bos = tokenizer.bosTokenId, + ) + + // Warmup one decode (JIT + kernel-provider resolution), then time. + decode(rtNat, promptTokens, 2, eos, eot) + val t0 = System.nanoTime() + val genNat = decode(rtNat, promptTokens, maxNew, eos, eot) + val ms = (System.nanoTime() - t0) / 1e6 + val toks = genNat.size + promptTokens.size + + println("Q5K-packed gen=$genNat") + println("FP32-base gen=$genFp32") + println("Q5K decoded='${tokenizer.decode(genNat.toIntArray()).replace("\n", "\\n")}'") + println( + "Q5K-packed throughput: $toks tok in ${"%.0f".format(ms)} ms " + + "(${"%.2f".format(toks * 1000.0 / ms)} tok/s incl. prefill)", + ) + + assertEquals(genFp32, genNat, "Q5_K packed decode diverged from FP32 baseline") + } + + // The wired path: GemmaNetworkLoader.load(NATIVE_OPTIMIZED) applies the + // commonMain convertGemmaWeightsPacked (the board path) — no MemSeg, no + // Arena. Must decode identically to the FP32 baseline too. + val mLoad = GemmaNetworkLoader.fromGguf( + randomAccessProvider = { JvmRandomAccessSource.open(gguf) }, + quantPolicy = QuantPolicy.NATIVE_OPTIMIZED, + ).load(ctx) + val rtLoad = OptimizedLLMRuntime( + model = mLoad, ctx = ctx, mode = OptimizedLLMMode.DIRECT, + dtype = FP32::class, bos = tokenizer.bosTokenId, + ) + val genLoad = decode(rtLoad, promptTokens, maxNew, eos, eot) + println("load(NATIVE_OPTIMIZED) gen=$genLoad") + assertEquals(genFp32, genLoad, "load(NATIVE_OPTIMIZED) packed decode diverged from FP32 baseline") + } +} diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt index 227fb351..59ddc216 100644 --- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt @@ -35,7 +35,7 @@ import kotlin.test.Test class RealGemmaBakeIrpaTest { @Test fun bakeRealGemmaToIrpa() = runBlocking { - val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" val ctx = DirectCpuExecutionContext.create() val weights = Gemma4WeightLoader( randomAccessProvider = { JvmRandomAccessSource.open(path) }, diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt index cbd6ebf8..af3c5e01 100644 --- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt @@ -17,7 +17,7 @@ import kotlin.test.Test class RealGemmaDequantDumpTest { @Test fun dumpDequant() = runBlocking { - val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" val ctx = DirectCpuExecutionContext.create() val weights = Gemma4WeightLoader( randomAccessProvider = { JvmRandomAccessSource.open(path) }, diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt index 3bfccce1..f0037477 100644 --- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt @@ -24,7 +24,7 @@ import kotlin.test.Test class RealGemmaEagerAbTest { @Test fun eagerLogits() = runBlocking { - val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" val ctx = DirectCpuExecutionContext.create() val weights = Gemma4WeightLoader( randomAccessProvider = { JvmRandomAccessSource.open(path) }, diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt index f90bda23..019dcd86 100644 --- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt @@ -32,7 +32,7 @@ import kotlin.test.Test class RealGemmaExternalParamTest { @Test fun externalizeRealGemmaWeights() = runBlocking { - val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" val ctx = DirectCpuExecutionContext.create() val weights = Gemma4WeightLoader( randomAccessProvider = { JvmRandomAccessSource.open(path) }, diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt index 28952531..2905da6c 100644 --- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt +++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt @@ -21,7 +21,7 @@ import kotlin.test.Test class RealGemmaLoadTest { @Test fun loadFunctionGemmaWeights() = runBlocking { - val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" + val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf" val ctx = DirectCpuExecutionContext.create() val loader = Gemma4WeightLoader( randomAccessProvider = { JvmRandomAccessSource.open(path) },