diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5d79cc3d..d688dc81 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,110 @@ version line is kept in lock-step with the underlying SKaiNET engine
 The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.30.0] — 2026-06-14
+
+Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers
+tracked the engine internally across that window (the in-progress Q5_K kernel
+shipped as a local `0.29.1`) without a tagged release. The headline is
+**Q5_K stays packed in the eager Gemma runtime** and the **Gemma
+`NATIVE_OPTIMIZED` packed-weight path is now Kotlin/Native–ready** — the board
+binary can keep K-quant weights packed without the JVM's `java.lang.foreign`
+MemSeg path.
+
+### Added
+
+- **Q5_K packed in-kernel dequant in the eager Gemma runtime.** FunctionGemma-270M
+  ships as `Q5_K_M`, but `GemmaMemSegConverter` previously dequantized Q5_K
+  weights to FP32 on load ("no native matmul kernel yet for Q5_K"), giving up
+  both the memory saving and the in-kernel dequant. SKaiNET 0.30.0 provides a
+  first-class Q5_K packed matmul (`Q5_KBlockTensorData` + `Q5KMatmulKernel`:
+  scalar / Panama / native), so the converter now relayouts the GGUF bytes to
+  block-major and wraps them as `Q5_KBlockTensorData` (176 B/block). Dispatch and
+  the lazy transpose reach the kernel through `DefaultCpuOps`. Verified by
+  `GemmaQ5KPackedParityTest` (`-PincludeIntegration`): the Q5_K packed path
+  decodes FunctionGemma byte-identically to the FP32 baseline —
+  `[262146, 236769, 3255, 718, 498, 1373, 262152, 106]` →
+  `<tool_0>(state="on")<end>` for *"Turn the light on."*
+- **Kotlin/Native–ready Gemma packed-weight path.** The `NATIVE_OPTIMIZED`
+  packed conversion was `jvmMain`-only (it built `MemSeg`/`Arena`-backed tensors
+  via `java.lang.foreign`), so the Kotlin/Native board binary couldn't keep
+  K-quant weights packed. The platform-neutral pieces now live in `commonMain`:
+  - **`GemmaQuantLayout.kt`** (`commonMain`) — `logicalShapeFor`,
+    `relayoutKSeriesRowMajorToBlockMajor` (KMP-safe `copyInto`), and
+    `packGemmaKQuant<T>()`, which builds heap-packed Q4_K/Q5_K/Q6_K
+    `BlockTensorData` directly with no `MemSeg`/`Arena`.
+  - **`GemmaPackedWeights.kt`** (`commonMain`) — `convertGemmaWeightsPacked`
+    packs Q4/Q5/Q6_K matmul weights to heap `Q*_KBlockTensorData`, dequants
+    `token_embd`/`output` to FP32 (gathered, no transpose) and any other quant
+    type to FP32 `[out, in]`. `extractRawBytes` reads the loader's bytes back
+    across both backings (JVM `IntArrayTensorData` / native `Byte`-typed).
+  - **`GemmaNetworkLoader.load()`** now runs `convertGemmaWeightsPacked` before
+    `applyWeightsToNetwork` under `NATIVE_OPTIMIZED`, so `load(NATIVE_OPTIMIZED)`
+    yields a runnable network on the board *and* the JVM (previously it could not
+    be built from raw-byte weights at all). `GemmaMemSegConverter` (`jvmMain`)
+    now shares the `commonMain` helpers; only the `MemSeg`/FFM conversion and the
+    FP32 fallbacks stay JVM-only.
+  Verified on JVM and `linuxX64` (`GemmaQuantLayoutTest`): relayout, packing, and
+  the native byte-extraction round-trip run on every target, and
+  `GemmaQ5KPackedParityTest` confirms all three paths (FP32 baseline, `jvmMain`
+  MemSeg-packed, `load()` packed) produce the identical token sequence.
+
+### Changed
+
+- **`gradle/libs.versions.toml` `skainet` pin: 0.28.1 → 0.30.0.** Picks up the
+  released Q5_K packed matmul, the NEON native kernels, and the Kotlin/Native
+  cinterop. Downstream consumers get the upstream SKaiNET BOM transparently via
+  `:llm-bom`, so no per-consumer migration is needed.
+- **`gradle.properties` `VERSION_NAME=0.30.0`.** Lock-step with the engine.
+- **`settings.gradle.kts` reverts the `mavenLocal()`-first dev shim.** The
+  ordering added while consuming the in-progress local SKaiNET `0.29.1` is no
+  longer needed now that 0.30.0 is on Maven Central; the release resolves the
+  engine purely from Central. The opt-in `-PuseLocalSkainet` composite build is
+  unchanged for local engine work.
+
+### Fixed
+
+- **`fix(gemma): dequant kernel-less quant types in `NATIVE_OPTIMIZED` instead of
+  leaving raw bytes`.** Loading a Gemma GGUF whose attention/FFN weights used a
+  quant type with no packed SIMD kernel (e.g. Q5_1) under
+  `QuantPolicy.NATIVE_OPTIMIZED` crashed at the first decode step
+  (`Transpose requires at least 2 dimensions` in `MultiHeadAttention` →
+  `linearProject`): `GemmaMemSegConverter.convertOne` left every unhandled quant
+  type as raw 1-D bytes. Kernel-less types now dequantize to a correct FP32
+  `[out, in]` weight via a new `dequantPackedToFp32` helper (mirroring the proven
+  `Gemma4WeightLoader.createTensor` column-major → row-major transpose). The
+  supported packed types (Q4_0/Q8_0/Q4_K/Q6_K) keep their fast SIMD form; only
+  kernel-less types pay the FP32 dequant.
+- **`fix(llama): dequantize Q4_1 (and all non-packed quant types) in
+  `DecoderGgufMemSegConverter``.** The converter handled only Q4_0/Q8_0 (packed)
+  and Q4_K/Q5_K/Q6_K (dequant); every other quant type fell through an `else`
+  branch that logged a warning and passed the raw quant bytes through unchanged,
+  crashing deep inside matmul (e.g. `unsupported quant type Q4_1 for
+  blk.0.ffn_down.weight` on Q4_1 Qwen3 models). The `else` branch now routes
+  through `DequantOps.dequantFromBytes` to FP32, covering Q4_1, Q5_0, Q5_1, Q8_1,
+  IQ4_NL/XS, TQ1/2_0, etc.; genuinely unknown types now fail explicitly at load
+  time instead of crashing later inside matmul. Closes
+  [#654](https://github.com/SKaiNET-developers/SKaiNET-transformers/issues/654).
+
+### Tests / CI
+
+- **`GemmaQ5KPackedParityTest`** — byte-identical decode parity across the FP32
+  baseline, the `jvmMain` MemSeg-packed path, and the `load(NATIVE_OPTIMIZED)`
+  `commonMain` packed path.
+- **`GemmaQuantLayoutTest`** (`commonTest`) — block-transpose relayout, packing,
+  and the byte-extraction round-trip; runs on JVM and `linuxX64`.
+- **`DecoderGgufMemSegConverterTest`** — regression that a Q4_1 weight is
+  dequantized to its logical 2-D FP32 shape rather than passed through as 1-D
+  bytes.
+- **`fix(gemma): macosArm64 target for `gemma-iree``** and CI parity fixes:
+  MLIR-dump tests write to a portable build dir instead of a hardcoded local
+  path; browser Mocha gets a 60 s timeout (parity with the engine repo).
+- **`test(gemma): repoint stale FunctionGemma GGUF path`** — six real-model
+  integration tests now point at the in-repo
+  `sl2610-function-calling/models/` location, matching
+  `GemmaQ5KPackedParityTest`; all pass against the published SKaiNET 0.30.0
+  (`-PincludeIntegration`).
+
 ## [0.28.1] — 2026-06-06
 
 Version-aligned with **SKaiNET 0.28.1**. Skips 0.26.x / 0.27.x —
@@ -385,6 +489,8 @@ Version-aligned with **SKaiNET 0.21.0**.
 Last published transformers release before the engine-aligned version line.
 See `git log v0.16.0..0.18.0` for details.
 
+[0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0
+[0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1
 [0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1
 [0.21.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.21.1
 [0.21.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.21.0
diff --git a/README.md b/README.md
index f5901dd0..a2d7681d 100644
--- a/README.md
+++ b/README.md
@@ -103,22 +103,21 @@ Honest status — see the project-status note at the top of this README.
 
 ## Current release
 
-The current release is **0.28.1** — version-aligned with **SKaiNET 0.28.1**.
-Skips 0.26.x / 0.27.x: SKaiNET-transformers tracked the engine internally across
-that window without a tagged release. The headline is that the engine's
-**Kotlin DSL → StableHLO → IREE export path is now complete** — a full gemma3
-graph traces and lowers to StableHLO that `iree-compile`s to a `vmfb`
-(`GemmaMlirDumpTest` / `GemmaTraceTest` are green against 0.28.1). SKaiNET
-0.28.0/0.28.1 fixed the remaining export bugs: result-type inference for
-`reshape`/`matmul`/`concatenate` ([#673](https://github.com/SKaiNET-developers/SKaiNET/issues/673))
-and `conv1d`/`gather`/pooling/`flatten` shapes plus the `reduce_window` emission
-form ([#675](https://github.com/SKaiNET-developers/SKaiNET/issues/675)).
+The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**.
+Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that
+window without a tagged release. The headline is that **Q5_K weights now stay
+packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K
+packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now
+**Kotlin/Native–ready** — the board binary can keep K-quant weights packed
+without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`)
+decodes byte-identically across the FP32 baseline and both packed paths
+(`GemmaQ5KPackedParityTest`).
 
 The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.
 
 ```kotlin
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
 
     // Versions resolved from the BOM:
     implementation("sk.ainet.transformers:skainet-transformers-core")
@@ -195,6 +194,27 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
 
 See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.
 
+## What's new in 0.30.0
+
+- **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to
+  dequantize Q5_K weights to FP32 on load; SKaiNET 0.30.0 provides a first-class
+  Q5_K packed matmul (`Q5_KBlockTensorData` + `Q5KMatmulKernel`), so the converter
+  now relayouts the GGUF bytes to block-major and keeps them packed (176 B/block).
+  FunctionGemma-270M (`Q5_K_M`) decodes byte-identically to the FP32 baseline
+  (`GemmaQ5KPackedParityTest`).
+- **Gemma `NATIVE_OPTIMIZED` path is Kotlin/Native–ready.** The reusable layout +
+  packing helpers (`GemmaQuantLayout.kt`, `GemmaPackedWeights.kt`) moved to
+  `commonMain`, and `GemmaNetworkLoader.load()` now runs `convertGemmaWeightsPacked`
+  under `NATIVE_OPTIMIZED` — so the board binary keeps K-quant weights packed with
+  no `java.lang.foreign` MemSeg dependency. Verified on JVM and `linuxX64`.
+- **Engine pin `skainet 0.28.1 → 0.30.0`** — released Q5_K packed matmul, NEON
+  native kernels, and Kotlin/Native cinterop. The `mavenLocal()`-first dev shim is
+  reverted; the release resolves the engine from Maven Central.
+- **Fixes.** Kernel-less quant types under `NATIVE_OPTIMIZED` now dequant to FP32
+  `[out, in]` instead of crashing on a rank-1 transpose; `DecoderGgufMemSegConverter`
+  dequantizes Q4_1 and every other non-packed quant type instead of passing raw
+  bytes through to a matmul crash ([#654](https://github.com/SKaiNET-developers/SKaiNET-transformers/issues/654)).
+
 ## What's new in 0.28.1
 
 - **Engine pin `skainet 0.27.0 → 0.28.1`.** Picks up the completed Kotlin DSL →
diff --git a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
index d5e51c88..87548dcf 100644
--- a/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
+++ b/docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
@@ -25,7 +25,7 @@ In your `build.gradle.kts`:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
 
     implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
     implementation("sk.ainet.transformers:skainet-transformers-agent")
@@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
     <dependency>
       <groupId>sk.ainet.transformers</groupId>
       <artifactId>skainet-transformers-bom</artifactId>
-      <version>0.28.1</version>
+      <version>0.30.0</version>
       <type>pom</type>
       <scope>import</scope>
     </dependency>
diff --git a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
index 710da06b..07f123c7 100644
--- a/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
+++ b/docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
@@ -52,7 +52,7 @@ The pieces you need live in three modules:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.28.1"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
 
     implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
     implementation("sk.ainet.transformers:skainet-transformers-agent")
diff --git a/gradle.properties b/gradle.properties
index 7efd6ccd..1987d82c 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -1,5 +1,5 @@
 GROUP=sk.ainet.transformers
-VERSION_NAME=0.28.1
+VERSION_NAME=0.30.0
 
 POM_DESCRIPTION=SKaiNET-transformers
 
diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
index 66e7fb68..5aa078ed 100644
--- a/gradle/libs.versions.toml
+++ b/gradle/libs.versions.toml
@@ -1,5 +1,5 @@
 [versions]
-skainet = "0.28.1"
+skainet = "0.30.0"
 agp = "9.2.1"
 jacksonDatabind = "2.22.0"
 jsonSchemaValidator = "3.0.3"
diff --git a/llm-agent/api/jvm/llm-agent.api b/llm-agent/api/jvm/llm-agent.api
index edde6a76..54b8610a 100644
--- a/llm-agent/api/jvm/llm-agent.api
+++ b/llm-agent/api/jvm/llm-agent.api
@@ -1,6 +1,6 @@
 public final class sk/ainet/apps/kllama/agent/GenerateExtensionsKt {
-	public static final fun generateUntilStop (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;)Lsk/ainet/apps/kllama/agent/GenerateResult;
-	public static synthetic fun generateUntilStop$default (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;ILjava/lang/Object;)Lsk/ainet/apps/kllama/agent/GenerateResult;
+	public static final fun generateUntilStop (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function2;)Lsk/ainet/apps/kllama/agent/GenerateResult;
+	public static synthetic fun generateUntilStop$default (Lsk/ainet/apps/llm/InferenceRuntime;[IIIFLkotlin/random/Random;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function1;Lkotlin/jvm/functions/Function2;ILjava/lang/Object;)Lsk/ainet/apps/kllama/agent/GenerateResult;
 	public static final fun sampleFromLogits (Lsk/ainet/lang/tensor/Tensor;FLkotlin/random/Random;)I
 	public static synthetic fun sampleFromLogits$default (Lsk/ainet/lang/tensor/Tensor;FLkotlin/random/Random;ILjava/lang/Object;)I
 }
@@ -45,6 +45,7 @@ public final class sk/ainet/apps/kllama/chat/AgentConfig {
 public abstract interface class sk/ainet/apps/kllama/chat/AgentListener {
 	public fun onAssistantMessage (Ljava/lang/String;)V
 	public fun onComplete (Ljava/lang/String;)V
+	public fun onPrefillProgress (II)V
 	public fun onThinking (Ljava/lang/String;)V
 	public fun onToken (Ljava/lang/String;)V
 	public fun onToolCallValidationFailed (Lsk/ainet/apps/kllama/chat/ToolCall;Ljava/lang/String;)V
@@ -55,6 +56,7 @@ public abstract interface class sk/ainet/apps/kllama/chat/AgentListener {
 public final class sk/ainet/apps/kllama/chat/AgentListener$DefaultImpls {
 	public static fun onAssistantMessage (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
 	public static fun onComplete (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
+	public static fun onPrefillProgress (Lsk/ainet/apps/kllama/chat/AgentListener;II)V
 	public static fun onThinking (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
 	public static fun onToken (Lsk/ainet/apps/kllama/chat/AgentListener;Ljava/lang/String;)V
 	public static fun onToolCallValidationFailed (Lsk/ainet/apps/kllama/chat/AgentListener;Lsk/ainet/apps/kllama/chat/ToolCall;Ljava/lang/String;)V
diff --git a/llm-core/api/jvm/llm-core.api b/llm-core/api/jvm/llm-core.api
index 5d72b5a3..aecfb28d 100644
--- a/llm-core/api/jvm/llm-core.api
+++ b/llm-core/api/jvm/llm-core.api
@@ -543,8 +543,8 @@ public final class sk/ainet/lang/nn/dsl/ATTENTION$DefaultImpls {
 }
 
 public final class sk/ainet/lang/nn/dsl/AttentionImpl : sk/ainet/lang/nn/dsl/ATTENTION {
-	public fun <init> (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;)V
-	public synthetic fun <init> (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
+	public fun <init> (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;Lkotlin/reflect/KClass;)V
+	public synthetic fun <init> (Lsk/ainet/context/ExecutionContext;IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Ljava/lang/Integer;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
 	public final fun create ()Lsk/ainet/lang/nn/transformer/MultiHeadAttention;
 	public fun getExecutionContext ()Lsk/ainet/context/ExecutionContext;
 	public fun kvCache (III)V
@@ -653,8 +653,8 @@ public abstract interface class sk/ainet/lang/nn/normalization/FusedRmsNormOps {
 }
 
 public final class sk/ainet/lang/nn/normalization/RMSNormalization : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters {
-	public fun <init> ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;Z)V
-	public synthetic fun <init> ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;ZILkotlin/jvm/internal/DefaultConstructorMarker;)V
+	public fun <init> ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;ZLkotlin/reflect/KClass;)V
+	public synthetic fun <init> ([IDLjava/lang/String;Lsk/ainet/lang/tensor/Tensor;ZLkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
 	public fun forward (Lsk/ainet/lang/tensor/Tensor;Lsk/ainet/context/ExecutionContext;)Lsk/ainet/lang/tensor/Tensor;
 	public fun getModules ()Ljava/util/List;
 	public fun getName ()Ljava/lang/String;
@@ -670,8 +670,8 @@ public final class sk/ainet/lang/nn/transformer/AppendKVCache : sk/ainet/lang/nn
 }
 
 public final class sk/ainet/lang/nn/transformer/GeGLUFFN : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters {
-	public fun <init> (IILjava/lang/String;)V
-	public synthetic fun <init> (IILjava/lang/String;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
+	public fun <init> (IILjava/lang/String;Lkotlin/reflect/KClass;)V
+	public synthetic fun <init> (IILjava/lang/String;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
 	public final fun getDim ()I
 	public final fun getHiddenDim ()I
 	public fun getModules ()Ljava/util/List;
@@ -695,8 +695,8 @@ public abstract class sk/ainet/lang/nn/transformer/KVCache : sk/ainet/lang/nn/Mo
 
 public final class sk/ainet/lang/nn/transformer/LayerScalarMul : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters {
 	public fun <init> ()V
-	public fun <init> (Ljava/lang/String;)V
-	public synthetic fun <init> (Ljava/lang/String;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
+	public fun <init> (Ljava/lang/String;Lkotlin/reflect/KClass;)V
+	public synthetic fun <init> (Ljava/lang/String;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
 	public fun getModules ()Ljava/util/List;
 	public fun getName ()Ljava/lang/String;
 	public fun getParams ()Ljava/util/List;
@@ -707,8 +707,8 @@ public final class sk/ainet/lang/nn/transformer/LinearProjectionKt {
 }
 
 public final class sk/ainet/lang/nn/transformer/MultiHeadAttention : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters {
-	public fun <init> (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;)V
-	public synthetic fun <init> (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
+	public fun <init> (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;Lkotlin/reflect/KClass;)V
+	public synthetic fun <init> (IIIZZZDLjava/lang/Float;ZZLjava/lang/String;Lsk/ainet/lang/nn/transformer/RoPE;Lsk/ainet/lang/nn/transformer/KVCache;Ljava/lang/Integer;Ljava/lang/Integer;Lkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
 	public final fun forward (Lsk/ainet/lang/tensor/Tensor;Lsk/ainet/lang/tensor/Tensor;Lsk/ainet/context/ExecutionContext;)Lsk/ainet/lang/tensor/Tensor;
 	public final fun getAttentionScale ()Ljava/lang/Float;
 	public final fun getBias ()Z
@@ -847,7 +847,8 @@ public final class sk/ainet/lang/nn/transformer/SwiGLUFFN : sk/ainet/lang/nn/Mod
 }
 
 public final class sk/ainet/lang/nn/transformer/VoidDense : sk/ainet/lang/nn/Module, sk/ainet/lang/nn/topology/ModuleParameters {
-	public fun <init> (Ljava/lang/String;II)V
+	public fun <init> (Ljava/lang/String;IILkotlin/reflect/KClass;)V
+	public synthetic fun <init> (Ljava/lang/String;IILkotlin/reflect/KClass;ILkotlin/jvm/internal/DefaultConstructorMarker;)V
 	public final fun getInDim ()I
 	public fun getModules ()Ljava/util/List;
 	public fun getName ()Ljava/lang/String;
diff --git a/llm-inference/gemma/api/jvm/gemma.api b/llm-inference/gemma/api/jvm/gemma.api
index 57fcd67f..4483f8cd 100644
--- a/llm-inference/gemma/api/jvm/gemma.api
+++ b/llm-inference/gemma/api/jvm/gemma.api
@@ -865,6 +865,10 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt {
 	public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module;
 }
 
+public final class sk/ainet/models/gemma/GemmaPackedWeightsKt {
+	public static final fun convertGemmaWeightsPacked (Lsk/ainet/models/gemma/Gemma4Weights;Lsk/ainet/context/ExecutionContext;)Lsk/ainet/models/gemma/Gemma4Weights;
+}
+
 public final class sk/ainet/models/gemma/GemmaPerLayerTokenEmbedTensorData : sk/ainet/lang/tensor/data/TensorData, sk/ainet/models/gemma/RowDequantSource {
 	public fun <init> (Lsk/ainet/lang/tensor/Shape;Lsk/ainet/io/gguf/GGMLQuantizationType;[B)V
 	public fun copyToFloatArray ()[F
diff --git a/llm-inference/gemma/build.gradle.kts b/llm-inference/gemma/build.gradle.kts
index 24ea30d7..f541c944 100644
--- a/llm-inference/gemma/build.gradle.kts
+++ b/llm-inference/gemma/build.gradle.kts
@@ -88,9 +88,16 @@ kotlin {
     }
 }
 
+// Real-model (FunctionGemma-270M) integration tests (run with -PincludeIntegration)
+// dequantize ~270M params to FP32, and GemmaQ5KPackedParityTest holds the FP32
+// baseline plus both packed decode networks at once; the bake-to-irpa test holds
+// weights + serialized bytes simultaneously. 8g OOMs once the real model is
+// present, so default to 12g — override via -PgemmaTestMaxHeap (CI without the
+// model file self-skips these and never needs the headroom).
 tasks.withType<Test>().configureEach {
     jvmArgs("--enable-preview", "--add-modules", "jdk.incubator.vector")
-    maxHeapSize = (findProperty("gemmaTestMaxHeap") as? String) ?: "6g"
+    maxHeapSize = (findProperty("gemmaTestMaxHeap") as? String) ?: "12g"
+    (findProperty("seqLen") as? String)?.let { systemProperty("seqLen", it) }
 }
 
 // Kotlin/JS + Kotlin/WASM browser test runners have two separate problems on
@@ -109,11 +116,3 @@ tasks.matching { it.name == "jsBrowserTest" || it.name == "wasmJsBrowserTest" }.
         ?.failOnNoDiscoveredTests = false
     enabled = includeBrowserTests
 }
-
-// Real-model (FunctionGemma-270M) tests dequantize ~270M params to FP32 and the
-// bake-to-irpa test holds weights + serialized bytes simultaneously; allow an
-// override via -PgemmaTestMaxHeap (default 8g).
-tasks.withType<Test>().configureEach {
-    maxHeapSize = (findProperty("gemmaTestMaxHeap") as? String) ?: "8g"
-    (findProperty("seqLen") as? String)?.let { systemProperty("seqLen", it) }
-}
diff --git a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt
index f73b3ac8..abc8c3e3 100644
--- a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt
+++ b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaNetworkLoader.kt
@@ -122,7 +122,7 @@ public class GemmaNetworkLoader @PublishedApi internal constructor(
     public suspend inline fun <reified T : DType, V> load(
         ctx: ExecutionContext
     ): Module<T, V> {
-        val weights: Gemma4Weights<T, V> = when (val wp = weightsProvider) {
+        val rawWeights: Gemma4Weights<T, V> = when (val wp = weightsProvider) {
             is WeightsProvider.GgufSource -> {
                 val loader = Gemma4WeightLoader(wp.sourceProvider, quantPolicy = wp.quantPolicy)
                 loader.loadToMap<T, V>(ctx)
@@ -142,6 +142,24 @@ public class GemmaNetworkLoader @PublishedApi internal constructor(
             }
         }
 
+        // NATIVE_OPTIMIZED yields raw-byte quant tensors the network mapper can't
+        // consume directly. Pack them (heap Q4/5/6_K + FP32 fallback) here — this
+        // is commonMain so it works on Kotlin/Native (the board) as well as the
+        // JVM, and replaces the JVM-only `convertGemmaWeightsToMemSeg` for the
+        // `load()` entry point.
+        val ggufPolicy = when (val wp = weightsProvider) {
+            is WeightsProvider.GgufSource -> wp.quantPolicy
+            is WeightsProvider.GgufRandomAccess -> wp.quantPolicy
+            else -> null
+        }
+        val weights: Gemma4Weights<T, V> =
+            if (ggufPolicy == QuantPolicy.NATIVE_OPTIMIZED) {
+                @Suppress("UNCHECKED_CAST")
+                convertGemmaWeightsPacked(rawWeights, ctx) as Gemma4Weights<T, V>
+            } else {
+                rawWeights
+            }
+
         return applyWeightsToNetwork(ctx, weights)
     }
 
diff --git a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaPackedWeights.kt b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaPackedWeights.kt
new file mode 100644
index 00000000..ec52eb4c
--- /dev/null
+++ b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaPackedWeights.kt
@@ -0,0 +1,125 @@
+package sk.ainet.models.gemma
+
+import sk.ainet.context.ExecutionContext
+import sk.ainet.io.gguf.GGMLQuantizationType
+import sk.ainet.io.gguf.dequant.DequantOps
+import sk.ainet.lang.tensor.Shape
+import sk.ainet.lang.tensor.Tensor
+import sk.ainet.lang.tensor.data.IntArrayTensorData
+import sk.ainet.lang.tensor.data.TensorData
+import sk.ainet.lang.types.DType
+import sk.ainet.lang.types.FP32
+
+/**
+ * commonMain (Kotlin/Native-capable) analogue of the jvmMain
+ * `convertGemmaWeightsToMemSeg`. Converts the raw-byte quantized tensors a
+ * `NATIVE_OPTIMIZED` load produces into the forms the DSL matmul path consumes:
+ *
+ * - **Q4_K / Q5_K / Q6_K matmul weights** → heap-packed `Q{4,5,6}_KBlockTensorData`
+ *   (via [packGemmaKQuant], with the row-major→block-major relayout). These keep
+ *   the GGUF footprint and run the in-kernel dequant matmul (NEON on the board).
+ * - **token_embd / output** → FP32 dequant in canonical `[vocab, embed]` order
+ *   (the embedding is gathered, not matmul'd, so no transpose).
+ * - **everything else quantized** → FP32 dequant transposed to `[out, in]`
+ *   row-major so `linearProject` (`x @ W.t()`) is correct.
+ *
+ * Unlike the MemSeg converter this uses no `java.lang.foreign` — it runs on the
+ * SL2610 board binary (Kotlin/Native) as well as the JVM. The JVM still prefers
+ * the MemSeg path (lazy transpose + Q4/Q8 MemSeg); this is the board path.
+ */
+public fun convertGemmaWeightsPacked(
+    weights: Gemma4Weights<*, *>,
+    ctx: ExecutionContext,
+): Gemma4Weights<*, *> {
+    @Suppress("UNCHECKED_CAST")
+    val typed = weights as Gemma4Weights<DType, Any>
+    val quantTypes = typed.quantTypes
+    if (quantTypes.isEmpty()) return weights
+
+    val logicalShapes = typed.logicalShapes
+    val newTensors = linkedMapOf<String, Tensor<DType, Any>>()
+    for ((name, tensor) in typed.tensors) {
+        val qt = quantTypes[name]
+        newTensors[name] = when {
+            qt == null -> tensor // not quantized
+            else -> {
+                val shape = logicalShapes[name] ?: logicalShapeFor(name, typed.metadata)
+                if (shape == null) {
+                    tensor // unknown 2-D layout — leave as-is
+                } else {
+                    val bytes = extractRawBytes(tensor.data)
+                    val isEmbed = name == Gemma4TensorNames.TOKEN_EMBEDDINGS ||
+                        name == Gemma4TensorNames.OUTPUT_WEIGHT
+                    val packed = if (!isEmbed) packGemmaKQuant<FP32>(bytes, qt, shape) else null
+                    when {
+                        packed != null -> {
+                            @Suppress("UNCHECKED_CAST")
+                            ctx.fromData(packed as TensorData<FP32, Float>, FP32::class) as Tensor<DType, Any>
+                        }
+                        isEmbed -> dequantNoTranspose(bytes, qt, shape, ctx)
+                        else -> dequantTransposed(bytes, qt, shape, ctx)
+                    }
+                }
+            }
+        }
+    }
+    @Suppress("UNCHECKED_CAST")
+    return Gemma4Weights(typed.metadata, newTensors, typed.quantTypes, typed.logicalShapes) as Gemma4Weights<*, *>
+}
+
+/** Dequant to FP32 in natural `[rows, cols]` order (embeddings — gathered, not matmul'd). */
+@Suppress("UNCHECKED_CAST")
+private fun dequantNoTranspose(
+    bytes: ByteArray,
+    qt: GGMLQuantizationType,
+    shape: Shape,
+    ctx: ExecutionContext,
+): Tensor<DType, Any> {
+    val floats = DequantOps.dequantFromBytes(bytes, qt, shape.volume)
+    return ctx.fromFloatArray<FP32, Float>(shape, FP32::class, floats) as Tensor<DType, Any>
+}
+
+/**
+ * Dequant to a canonical FP32 `[out, in]` row-major weight. GGUF stores K/legacy
+ * blocks column-major within a row, so the dequantized floats are transposed
+ * column-major → row-major to match what `linearProject` (`x @ W.t()`) expects.
+ */
+@Suppress("UNCHECKED_CAST")
+private fun dequantTransposed(
+    bytes: ByteArray,
+    qt: GGMLQuantizationType,
+    shape: Shape,
+    ctx: ExecutionContext,
+): Tensor<DType, Any> {
+    val floats = DequantOps.dequantFromBytes(bytes, qt, shape.volume)
+    val out = shape[0]
+    val inDim = shape[1]
+    val rowMajor = DequantOps.transposeColumnMajorToRowMajor(floats, inDim, out)
+    return ctx.fromFloatArray<FP32, Float>(shape, FP32::class, rowMajor) as Tensor<DType, Any>
+}
+
+/**
+ * Read the raw packed bytes back from a `NATIVE_OPTIMIZED` quant tensor. The
+ * backing differs by platform/factory — JVM stores `IntArrayTensorData` (byte
+ * values widened to Int); Kotlin/Native stores a Byte-typed tensor — so handle
+ * both element types.
+ */
+internal fun extractRawBytes(data: TensorData<*, *>): ByteArray {
+    if (data is IntArrayTensorData<*>) {
+        val buf = data.buffer
+        return ByteArray(buf.size) { buf[it].toByte() }
+    }
+    val n = data.shape.volume
+    @Suppress("UNCHECKED_CAST")
+    val d = data as TensorData<*, Any?>
+    return ByteArray(n) {
+        when (val v = d[it]) {
+            is Byte -> v
+            is Int -> v.toByte()
+            else -> error(
+                "convertGemmaWeightsPacked: cannot read bytes from ${data::class.simpleName} " +
+                    "(element ${v?.let { e -> e::class.simpleName }})",
+            )
+        }
+    }
+}
diff --git a/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaQuantLayout.kt b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaQuantLayout.kt
new file mode 100644
index 00000000..7f4e7b9f
--- /dev/null
+++ b/llm-inference/gemma/src/commonMain/kotlin/sk/ainet/models/gemma/GemmaQuantLayout.kt
@@ -0,0 +1,121 @@
+package sk.ainet.models.gemma
+
+import sk.ainet.io.gguf.GGMLQuantizationType
+import sk.ainet.lang.tensor.Shape
+import sk.ainet.lang.tensor.data.Q4_KBlockTensorData
+import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
+import sk.ainet.lang.tensor.data.Q6_KBlockTensorData
+import sk.ainet.lang.tensor.data.TensorData
+import sk.ainet.lang.types.DType
+
+/**
+ * Platform-neutral (commonMain) layout helpers for Gemma 4 quantized weights.
+ *
+ * These were previously JVM-only (inside `GemmaMemSegConverter`), but the
+ * Kotlin/Native board path needs the same logic: on K/N there is no
+ * `java.lang.foreign` MemSeg conversion, so the eager runtime keeps K-quant
+ * weights as heap-packed `Q{4,5,6}_KBlockTensorData` produced here. The JVM
+ * MemSeg converter reuses the same relayout + shape recovery.
+ */
+
+/**
+ * Recover the logical 2-D shape of a Gemma 4 weight tensor from its GGUF name
+ * and model metadata. `Gemma4WeightLoader` with `NATIVE_OPTIMIZED` stores
+ * quantized tensors as 1-D byte arrays, so converters need the original
+ * `[rows, cols]` shape to re-layout blocks. Returns `null` for tensors without
+ * a 2-D matmul layout (norms, embeddings the converter dequantizes anyway).
+ */
+internal fun logicalShapeFor(name: String, metadata: Gemma4ModelMetadata): Shape? {
+    val embed = metadata.embeddingLength
+    val vocab = metadata.vocabSize
+    return when {
+        name == Gemma4TensorNames.TOKEN_EMBEDDINGS -> Shape(vocab, embed)
+        name == Gemma4TensorNames.OUTPUT_WEIGHT -> Shape(vocab, embed)
+        name.startsWith("blk.") -> {
+            val rest = name.substringAfter("blk.")
+            val layer = rest.substringBefore('.').toIntOrNull() ?: return null
+            val headDim = metadata.getHeadDim(layer)
+            val qDim = metadata.headCount * headDim
+            val kvDim = metadata.kvHeadCount * headDim
+            val ffn = metadata.intermediateSize
+            when {
+                name.endsWith(".attn_q.weight") -> Shape(qDim, embed)
+                name.endsWith(".attn_k.weight") -> Shape(kvDim, embed)
+                name.endsWith(".attn_v.weight") -> Shape(kvDim, embed)
+                name.endsWith(".attn_output.weight") -> Shape(embed, qDim)
+                name.endsWith(".ffn_gate.weight") -> Shape(ffn, embed)
+                name.endsWith(".ffn_up.weight") -> Shape(ffn, embed)
+                name.endsWith(".ffn_down.weight") -> Shape(embed, ffn)
+                else -> null
+            }
+        }
+        else -> null
+    }
+}
+
+/**
+ * Re-layout GGUF K-series bytes from row-major block order
+ * (`(r * blocksPerRow + b) * bytesPerBlock`) to the input-block-major order the
+ * `matmulQ{K}` kernels expect (`(b * outDim + r) * bytesPerBlock`). For a
+ * `[outDim, inDim]` weight with `inDim % 256 == 0`, this is a block-level 2-D
+ * transpose; bytes inside a block are untouched.
+ *
+ * @param bytesPerBlock 144 (Q4_K), 176 (Q5_K), 210 (Q6_K).
+ */
+internal fun relayoutKSeriesRowMajorToBlockMajor(
+    bytes: ByteArray,
+    shape: Shape,
+    bytesPerBlock: Int,
+): ByteArray {
+    val blockSize = 256
+    require(shape.rank == 2) { "K-series weight must be 2D, got rank ${shape.rank}" }
+    val outDim = shape[0]
+    val inDim = shape[1]
+    require(inDim % blockSize == 0) { "K-series weight inDim ($inDim) must be a multiple of $blockSize" }
+    val blocksPerRow = inDim / blockSize
+    val expected = outDim.toLong() * blocksPerRow.toLong() * bytesPerBlock.toLong()
+    require(bytes.size.toLong() >= expected) {
+        "K-series byte buffer ${bytes.size} < expected $expected for [$outDim, $inDim] @ ${bytesPerBlock}B/block"
+    }
+    val out = ByteArray(bytes.size)
+    for (r in 0 until outDim) {
+        for (b in 0 until blocksPerRow) {
+            val srcOff = (r * blocksPerRow + b) * bytesPerBlock
+            val dstOff = (b * outDim + r) * bytesPerBlock
+            bytes.copyInto(out, dstOff, srcOff, srcOff + bytesPerBlock)
+        }
+    }
+    return out
+}
+
+/** Bytes per ggml block for the K-quant types this packer handles. */
+private fun kQuantBytesPerBlock(qt: GGMLQuantizationType): Int? = when (qt) {
+    GGMLQuantizationType.Q4_K -> 144
+    GGMLQuantizationType.Q5_K -> 176
+    GGMLQuantizationType.Q6_K -> 210
+    else -> null
+}
+
+/**
+ * Pack raw GGUF K-quant `bytes` of logical `[out, in]` shape into the
+ * heap-packed block tensor data the matmul kernels read directly (Q4_K / Q5_K /
+ * Q6_K). Performs the row-major → block-major relayout. Returns `null` for
+ * non-K-quant types (caller dequantizes those to FP32).
+ *
+ * commonMain → works on JVM and Kotlin/Native alike (no MemSeg / Arena).
+ */
+internal fun <T : DType> packGemmaKQuant(
+    bytes: ByteArray,
+    qt: GGMLQuantizationType,
+    shape: Shape,
+): TensorData<T, *>? {
+    val bpb = kQuantBytesPerBlock(qt) ?: return null
+    val relaid = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, bpb)
+    @Suppress("UNCHECKED_CAST")
+    return when (qt) {
+        GGMLQuantizationType.Q4_K -> Q4_KBlockTensorData(shape, relaid) as TensorData<T, *>
+        GGMLQuantizationType.Q5_K -> Q5_KBlockTensorData(shape, relaid) as TensorData<T, *>
+        GGMLQuantizationType.Q6_K -> Q6_KBlockTensorData(shape, relaid) as TensorData<T, *>
+        else -> null
+    }
+}
diff --git a/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt
new file mode 100644
index 00000000..52a1cdd1
--- /dev/null
+++ b/llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt
@@ -0,0 +1,73 @@
+package sk.ainet.models.gemma
+
+import kotlin.test.Test
+import kotlin.test.assertEquals
+import kotlin.test.assertNull
+import kotlin.test.assertTrue
+import sk.ainet.context.DirectCpuExecutionContext
+import sk.ainet.io.gguf.GGMLQuantizationType
+import sk.ainet.lang.tensor.Shape
+import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
+import sk.ainet.lang.types.FP32
+import sk.ainet.lang.types.Int8
+
+/**
+ * Unit tests for the commonMain (board-shareable) Gemma quant layout helpers.
+ * These run on every target (JVM + Kotlin/Native), proving the K/N board path's
+ * relayout + packing logic without needing the full model.
+ */
+class GemmaQuantLayoutTest {
+
+    @Test
+    fun relayout_is_block_level_transpose() {
+        // [outDim=2, inDim=512] -> blocksPerRow=2, 4 Q5_K blocks of 176 B.
+        val bpb = 176
+        val outDim = 2
+        val inDim = 512
+        val blocksPerRow = inDim / 256
+        val bytes = ByteArray(outDim * blocksPerRow * bpb)
+        // Tag each source block with its row-major index in its first byte.
+        for (i in 0 until outDim * blocksPerRow) bytes[i * bpb] = i.toByte()
+
+        val relaid = relayoutKSeriesRowMajorToBlockMajor(bytes, Shape(outDim, inDim), bpb)
+
+        // dst block (b*outDim + r) must hold src block (r*blocksPerRow + b).
+        for (r in 0 until outDim) {
+            for (b in 0 until blocksPerRow) {
+                val srcIdx = r * blocksPerRow + b
+                val dstIdx = b * outDim + r
+                assertEquals(srcIdx.toByte(), relaid[dstIdx * bpb], "block ($r,$b) misplaced")
+            }
+        }
+    }
+
+    @Test
+    fun pack_q5k_produces_block_tensor_with_relaid_bytes() {
+        val shape = Shape(2, 512)
+        val bytes = ByteArray(2 * 2 * 176)
+        for (i in 0 until 4) bytes[i * 176] = (i + 1).toByte()
+
+        val td = packGemmaKQuant<FP32>(bytes, GGMLQuantizationType.Q5_K, shape)
+        assertTrue(td is Q5_KBlockTensorData, "Q5_K should pack to Q5_KBlockTensorData")
+        // packedData is the block-major relayout of the input.
+        val expected = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, 176)
+        assertTrue(expected.contentEquals(td.packedData))
+    }
+
+    @Test
+    fun pack_non_kquant_returns_null() {
+        assertNull(packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32)))
+    }
+
+    @Test
+    fun extract_raw_bytes_roundtrips_on_every_platform() {
+        // The NATIVE_OPTIMIZED loader wraps quant bytes via ctx.fromByteArray<Int8,Byte>;
+        // extractRawBytes must read them back regardless of the platform backing
+        // (JVM IntArrayTensorData vs native Byte-typed). Runs on jvm + linuxX64.
+        val ctx = DirectCpuExecutionContext.create()
+        val bytes = ByteArray(176 * 3) { ((it * 31 + 7) and 0xFF).toByte() }
+        val t = ctx.fromByteArray<Int8, Byte>(Shape(bytes.size), Int8::class, bytes)
+        val got = extractRawBytes(t.data)
+        assertTrue(bytes.contentEquals(got), "extractRawBytes round-trip mismatch")
+    }
+}
diff --git a/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt b/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt
index d3a4502f..191f2510 100644
--- a/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt
+++ b/llm-inference/gemma/src/jvmMain/kotlin/sk/ainet/models/gemma/GemmaMemSegConverter.kt
@@ -8,6 +8,7 @@ import sk.ainet.lang.tensor.Shape
 import sk.ainet.lang.tensor.Tensor
 import sk.ainet.lang.tensor.data.IntArrayTensorData
 import sk.ainet.lang.tensor.data.Q4_KBlockTensorData
+import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
 import sk.ainet.lang.tensor.data.Q6_KBlockTensorData
 import sk.ainet.lang.tensor.data.Q4MemorySegmentTensorData
 import sk.ainet.lang.tensor.data.Q8MemorySegmentTensorData
@@ -15,44 +16,9 @@ import sk.ainet.lang.tensor.data.TensorData
 import sk.ainet.lang.types.DType
 import sk.ainet.lang.types.FP32
 
-/**
- * Recover the logical 2-D shape of a Gemma 4 weight tensor from its GGUF
- * name and the model metadata. `Gemma4WeightLoader` with
- * `NATIVE_OPTIMIZED` stores quantized tensors as 1-D byte arrays so the
- * tensor-data factory accepts them; the converter needs the original
- * shape to re-layout blocks and construct `Q4_KBlockTensorData` /
- * `Q4/Q8MemorySegmentTensorData`.
- *
- * Returns `null` for tensors that don't have a 2-D matmul layout (norms,
- * embeddings the converter wants to dequant anyway).
- */
-internal fun logicalShapeFor(name: String, metadata: Gemma4ModelMetadata): Shape? {
-    val embed = metadata.embeddingLength
-    val vocab = metadata.vocabSize
-    return when {
-        name == Gemma4TensorNames.TOKEN_EMBEDDINGS -> Shape(vocab, embed)
-        name == Gemma4TensorNames.OUTPUT_WEIGHT -> Shape(vocab, embed)
-        name.startsWith("blk.") -> {
-            val rest = name.substringAfter("blk.")
-            val layer = rest.substringBefore('.').toIntOrNull() ?: return null
-            val headDim = metadata.getHeadDim(layer)
-            val qDim = metadata.headCount * headDim
-            val kvDim = metadata.kvHeadCount * headDim
-            val ffn = metadata.intermediateSize
-            when {
-                name.endsWith(".attn_q.weight") -> Shape(qDim, embed)
-                name.endsWith(".attn_k.weight") -> Shape(kvDim, embed)
-                name.endsWith(".attn_v.weight") -> Shape(kvDim, embed)
-                name.endsWith(".attn_output.weight") -> Shape(embed, qDim)
-                name.endsWith(".ffn_gate.weight") -> Shape(ffn, embed)
-                name.endsWith(".ffn_up.weight") -> Shape(ffn, embed)
-                name.endsWith(".ffn_down.weight") -> Shape(embed, ffn)
-                else -> null
-            }
-        }
-        else -> null
-    }
-}
+// logicalShapeFor + relayoutKSeriesRowMajorToBlockMajor moved to commonMain
+// (GemmaQuantLayout.kt) so the Kotlin/Native board path shares them. This
+// JVM-only file keeps the MemSeg (FFM) conversion + the FP32 dequant fallbacks.
 
 /**
  * Convert raw-byte quantized tensors in a [Gemma4Weights] map (produced by
@@ -197,8 +163,14 @@ private fun <T : DType, V> convertOne(
             ctx.fromData(data as TensorData<FP32, Float>, advertisedDtype) as Tensor<T, V>
         }
         GGMLQuantizationType.Q5_K -> {
-            // No native matmul kernel yet for Q5_K. Fall back to a correct FP32 dequant.
-            dequantPackedToFp32<T, V>(bytes, qt, shape, ctx)
+            // Same packed-path treatment as Q4_K/Q6_K, enabled by the Q5_K
+            // matmul kernel (scalar/Panama/native) + the lazy Q5_K transpose
+            // in DefaultCpuOps. FunctionGemma-270M Q5_K_M ships most attn/FFN
+            // weights as Q5_K, so keeping them packed (176 B/block) avoids the
+            // FP32 inflation and runs the in-kernel dequant matmul.
+            val relaid = relayoutKSeriesRowMajorToBlockMajor(bytes, shape, 176)
+            val data = Q5_KBlockTensorData.fromRawBytes(shape, relaid)
+            ctx.fromData(data as TensorData<FP32, Float>, advertisedDtype) as Tensor<T, V>
         }
         else -> {
             // Any other quant type without a packed SIMD kernel (Q5_0/Q5_1/Q4_1/Q2_K/…)
@@ -280,53 +252,9 @@ private fun <T : DType, V> dequantToFloat(
 }
 
 /**
- * Re-layout GGUF K-series bytes from row-major block order (block at row r,
- * block index b within row → byte offset `(r * blocksPerRow + b) * bytesPerBlock`)
- * to the input-block-major layout the `matmulQ{K}_Vec` kernels expect
- * (block at blockIdx bI for output row r → byte offset
- * `(bI * outDim + r) * bytesPerBlock`).
- *
- * For a weight of shape `[outDim, inDim]` with `inDim % 256 == 0` (the
- * K-series block size), this is just a 2D block-level transpose of the
- * `[outDim, inDim/256]` array of `bytesPerBlock`-byte blocks. Bytes
- * inside a block are untouched.
- *
- * @param bytes packed weight bytes in row-major [outDim, blocksPerRow] order
- * @param shape logical `[outDim, inDim]` shape
- * @param bytesPerBlock 144 for Q4_K, 210 for Q6_K (ggml block sizes)
- */
-internal fun relayoutKSeriesRowMajorToBlockMajor(
-    bytes: ByteArray,
-    shape: sk.ainet.lang.tensor.Shape,
-    bytesPerBlock: Int
-): ByteArray {
-    val blockSize = 256
-    require(shape.rank == 2) { "K-series weight must be 2D, got rank ${shape.rank}" }
-    val outDim = shape[0]
-    val inDim = shape[1]
-    require(inDim % blockSize == 0) {
-        "K-series weight inDim ($inDim) must be a multiple of $blockSize"
-    }
-    val blocksPerRow = inDim / blockSize
-    val expected = outDim.toLong() * blocksPerRow.toLong() * bytesPerBlock.toLong()
-    require(bytes.size.toLong() >= expected) {
-        "K-series byte buffer size ${bytes.size} < expected $expected for shape [$outDim, $inDim] @ ${bytesPerBlock}B/block"
-    }
-    val out = ByteArray(bytes.size)
-    for (r in 0 until outDim) {
-        for (b in 0 until blocksPerRow) {
-            val srcOff = (r * blocksPerRow + b) * bytesPerBlock
-            val dstOff = (b * outDim + r) * bytesPerBlock
-            System.arraycopy(bytes, srcOff, out, dstOff, bytesPerBlock)
-        }
-    }
-    return out
-}
-
-/**
- * Back-compat shim that delegates to [relayoutKSeriesRowMajorToBlockMajor]
- * at Q4_K's 144-byte block size. Kept for any callers outside this file
- * pinned to the old name.
+ * Back-compat shim that delegates to the commonMain
+ * [relayoutKSeriesRowMajorToBlockMajor] at Q4_K's 144-byte block size. Kept for
+ * any callers outside this file pinned to the old name.
  */
 internal fun relayoutQ4_KRowMajorToBlockMajor(bytes: ByteArray, shape: sk.ainet.lang.tensor.Shape): ByteArray =
     relayoutKSeriesRowMajorToBlockMajor(bytes, shape, 144)
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt
index 406197c6..3f938609 100644
--- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaBehavioralAbTest.kt
@@ -31,7 +31,7 @@ import kotlin.test.assertEquals
  */
 @Tag("integration")
 class GemmaBehavioralAbTest {
-    private val gguf = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+    private val gguf = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
 
     private fun argmax(a: FloatArray): Int {
         var bi = 0; var bv = a[0]
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaQ5KPackedParityTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaQ5KPackedParityTest.kt
new file mode 100644
index 00000000..1d4a7ad4
--- /dev/null
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/GemmaQ5KPackedParityTest.kt
@@ -0,0 +1,142 @@
+package sk.ainet.models.gemma
+
+import java.io.File
+import java.lang.foreign.Arena
+import kotlinx.coroutines.runBlocking
+import kotlinx.io.buffered
+import kotlinx.io.files.Path
+import kotlinx.io.files.SystemFileSystem
+import org.junit.jupiter.api.Assumptions
+import org.junit.jupiter.api.Tag
+import sk.ainet.apps.llm.OptimizedLLMMode
+import sk.ainet.apps.llm.OptimizedLLMRuntime
+import sk.ainet.apps.llm.tokenizer.GGUFTokenizer
+import sk.ainet.context.DirectCpuExecutionContext
+import sk.ainet.io.JvmRandomAccessSource
+import sk.ainet.io.model.QuantPolicy
+import sk.ainet.lang.types.FP32
+import kotlin.test.Test
+import kotlin.test.assertEquals
+
+/**
+ * End-to-end check that the NEW Q5_K packed in-kernel dequant path (upstream
+ * SKaiNET `Q5_KBlockTensorData` + `Q5KMatmulKernel`, wired here via
+ * [convertGemmaWeightsToMemSeg]) decodes FunctionGemma-270M (`Q5_K_M`)
+ * identically to the FP32-dequant baseline, and reports tokens/sec.
+ *
+ * Before this, the converter dequantized Q5_K weights to FP32 on load ("no
+ * native matmul kernel yet for Q5_K"). Now Q5_K stays packed (176 B/block)
+ * and runs the in-kernel dequant matmul. Both paths decode the same weights,
+ * so greedy argmax token sequences must match.
+ *
+ * Skips when the GGUF isn't present (CI without the checkpoint).
+ */
+@Tag("integration")
+class GemmaQ5KPackedParityTest {
+
+    private val gguf =
+        "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+
+    private fun argmax(a: FloatArray): Int {
+        var bi = 0; var bv = a[0]
+        for (i in 1 until a.size) if (a[i] > bv) { bv = a[i]; bi = i }
+        return bi
+    }
+
+    private fun buildPrompt(u: String) =
+        "<start_of_turn>user\n$u<end_of_turn>\n<start_of_turn>model\n"
+
+    private fun decode(
+        runtime: OptimizedLLMRuntime<FP32>,
+        promptTokens: List<Int>,
+        maxNew: Int,
+        eos: Int,
+        eot: Int,
+    ): List<Int> {
+        runtime.reset()
+        var logits = FloatArray(0)
+        for (t in promptTokens) logits = runtime.forward(t).data.copyToFloatArray()
+        val gen = mutableListOf<Int>()
+        while (gen.size < maxNew) {
+            val next = argmax(logits)
+            gen.add(next)
+            if (next == eos || next == eot) break
+            logits = runtime.forward(next).data.copyToFloatArray()
+        }
+        return gen
+    }
+
+    @Test
+    fun q5kPackedMatchesFp32() = runBlocking {
+        Assumptions.assumeTrue(File(gguf).exists(), "FunctionGemma GGUF not present — skipping")
+
+        val ctx = DirectCpuExecutionContext.create()
+        val tokenizer = GGUFTokenizer.fromSource(SystemFileSystem.source(Path(gguf)).buffered())
+        val eot = tokenizer.encode("<end_of_turn>").single()
+        val eos = tokenizer.eosTokenId
+        val promptTokens =
+            listOf(tokenizer.bosTokenId) + tokenizer.encode(buildPrompt("Turn the light on.")).toList()
+        val maxNew = 12
+
+        // --- FP32 dequant-on-load baseline ---
+        val wFp32 = Gemma4WeightLoader(
+            randomAccessProvider = { JvmRandomAccessSource.open(gguf) },
+            quantPolicy = QuantPolicy.DEQUANTIZE_TO_FP32,
+        ).loadToMapStreaming<FP32, Float>(ctx, FP32::class)
+        val mFp32 = GemmaNetworkLoader.fromWeights(ctx, wFp32, FP32::class)
+        val rtFp32 = OptimizedLLMRuntime(
+            model = mFp32, ctx = ctx, mode = OptimizedLLMMode.DIRECT,
+            dtype = FP32::class, bos = tokenizer.bosTokenId,
+        )
+        val genFp32 = decode(rtFp32, promptTokens, maxNew, eos, eot)
+
+        // --- Q5_K packed in-kernel dequant path (NATIVE_OPTIMIZED + convert) ---
+        Arena.ofConfined().use { arena ->
+            val wNat = Gemma4WeightLoader(
+                randomAccessProvider = { JvmRandomAccessSource.open(gguf) },
+                quantPolicy = QuantPolicy.NATIVE_OPTIMIZED,
+            ).loadToMapStreaming<FP32, Float>(ctx, FP32::class)
+            val wConv = convertGemmaWeightsToMemSeg(wNat, ctx, arena)
+            @Suppress("UNCHECKED_CAST")
+            val mNat = GemmaNetworkLoader.fromWeights(
+                ctx, wConv as Gemma4Weights<FP32, Float>, FP32::class,
+            )
+            val rtNat = OptimizedLLMRuntime(
+                model = mNat, ctx = ctx, mode = OptimizedLLMMode.DIRECT,
+                dtype = FP32::class, bos = tokenizer.bosTokenId,
+            )
+
+            // Warmup one decode (JIT + kernel-provider resolution), then time.
+            decode(rtNat, promptTokens, 2, eos, eot)
+            val t0 = System.nanoTime()
+            val genNat = decode(rtNat, promptTokens, maxNew, eos, eot)
+            val ms = (System.nanoTime() - t0) / 1e6
+            val toks = genNat.size + promptTokens.size
+
+            println("Q5K-packed gen=$genNat")
+            println("FP32-base  gen=$genFp32")
+            println("Q5K decoded='${tokenizer.decode(genNat.toIntArray()).replace("\n", "\\n")}'")
+            println(
+                "Q5K-packed throughput: $toks tok in ${"%.0f".format(ms)} ms " +
+                    "(${"%.2f".format(toks * 1000.0 / ms)} tok/s incl. prefill)",
+            )
+
+            assertEquals(genFp32, genNat, "Q5_K packed decode diverged from FP32 baseline")
+        }
+
+        // The wired path: GemmaNetworkLoader.load(NATIVE_OPTIMIZED) applies the
+        // commonMain convertGemmaWeightsPacked (the board path) — no MemSeg, no
+        // Arena. Must decode identically to the FP32 baseline too.
+        val mLoad = GemmaNetworkLoader.fromGguf(
+            randomAccessProvider = { JvmRandomAccessSource.open(gguf) },
+            quantPolicy = QuantPolicy.NATIVE_OPTIMIZED,
+        ).load<FP32, Float>(ctx)
+        val rtLoad = OptimizedLLMRuntime(
+            model = mLoad, ctx = ctx, mode = OptimizedLLMMode.DIRECT,
+            dtype = FP32::class, bos = tokenizer.bosTokenId,
+        )
+        val genLoad = decode(rtLoad, promptTokens, maxNew, eos, eot)
+        println("load(NATIVE_OPTIMIZED) gen=$genLoad")
+        assertEquals(genFp32, genLoad, "load(NATIVE_OPTIMIZED) packed decode diverged from FP32 baseline")
+    }
+}
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt
index 227fb351..59ddc216 100644
--- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaBakeIrpaTest.kt
@@ -35,7 +35,7 @@ import kotlin.test.Test
 class RealGemmaBakeIrpaTest {
     @Test
     fun bakeRealGemmaToIrpa() = runBlocking {
-        val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+        val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
         val ctx = DirectCpuExecutionContext.create()
         val weights = Gemma4WeightLoader(
             randomAccessProvider = { JvmRandomAccessSource.open(path) },
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt
index cbd6ebf8..af3c5e01 100644
--- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaDequantDumpTest.kt
@@ -17,7 +17,7 @@ import kotlin.test.Test
 class RealGemmaDequantDumpTest {
     @Test
     fun dumpDequant() = runBlocking {
-        val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+        val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
         val ctx = DirectCpuExecutionContext.create()
         val weights = Gemma4WeightLoader(
             randomAccessProvider = { JvmRandomAccessSource.open(path) },
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt
index 3bfccce1..f0037477 100644
--- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaEagerAbTest.kt
@@ -24,7 +24,7 @@ import kotlin.test.Test
 class RealGemmaEagerAbTest {
     @Test
     fun eagerLogits() = runBlocking {
-        val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+        val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
         val ctx = DirectCpuExecutionContext.create()
         val weights = Gemma4WeightLoader(
             randomAccessProvider = { JvmRandomAccessSource.open(path) },
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt
index f90bda23..019dcd86 100644
--- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaExternalParamTest.kt
@@ -32,7 +32,7 @@ import kotlin.test.Test
 class RealGemmaExternalParamTest {
     @Test
     fun externalizeRealGemmaWeights() = runBlocking {
-        val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+        val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
         val ctx = DirectCpuExecutionContext.create()
         val weights = Gemma4WeightLoader(
             randomAccessProvider = { JvmRandomAccessSource.open(path) },
diff --git a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt
index 28952531..2905da6c 100644
--- a/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt
+++ b/llm-inference/gemma/src/jvmTest/kotlin/sk/ainet/models/gemma/RealGemmaLoadTest.kt
@@ -21,7 +21,7 @@ import kotlin.test.Test
 class RealGemmaLoadTest {
     @Test
     fun loadFunctionGemmaWeights() = runBlocking {
-        val path = "/home/miso/projects/coral/sl2610-voice-cc-kt/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
+        val path = "/home/miso/projects/coral/SKaiNET-embedded/sl2610-function-calling/models/functiongemma-physical-ai-v10-Q5_K_M.gguf"
         val ctx = DirectCpuExecutionContext.create()
         val loader = Gemma4WeightLoader(
             randomAccessProvider = { JvmRandomAccessSource.open(path) },