Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,39 @@ version line is kept in lock-step with the underlying SKaiNET engine
The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.32.0] — 2026-06-25

Brings the real-GGUF **Llama** eager path up to the Gemma standard (packed
`NATIVE_OPTIMIZED`) and **unblocks StableHLO/IREE export for Llama-family models**
(traceable interleaved RoPE). Ships against engine **0.32.0**.

### Added

- **Eager `NATIVE_OPTIMIZED` packed path for Llama.** `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)`
keeps `Q4_K`/`Q6_K` weights packed and runs them through `OptimizedLLMRuntime` — new `LlamaQuantLayout`
+ `LlamaPackedWeights.convertLlamaWeightsPacked`, mirroring `convertGemmaWeightsPacked`. Coherent
output matching llama.cpp; the low-footprint path real-GGUF Llama inference on constrained ARM was
missing. (ccbd87e)

### Changed

- **Fused decode-attention fast path.** `MultiHeadAttention`'s decode step (`seqQ == 1`) now computes
scores → softmax → GQA-weighted-V directly from the cached K/V, bypassing the `repeatKVHeads` concat
and the `unsqueeze → SDPA → squeeze → permute` chain — ~1.5× decode throughput, bit-identical output.
Prefill (`seqLen > 1`) keeps the general SDPA path. (3791f88)
- **Engine pin `skainet 0.31.0 → 0.32.0`.**

### Fixed

- **Packed token-embedding gather for Llama** — `fromGguf(NATIVE_OPTIMIZED)` no longer fails with
`gather: unsupported input rank 1`; the packed embedding is wired through the canonical loader. (ccbd87e)
- **Interleaved RoPE is now traceable.** In `INTERLEAVED` mode (Llama / Mistral / most GGUF) the rotation
used a raw float-array path (`copyToFloatArray` / `fromFloatArray`) that, under graph tracing, baked the
rotated Q/K as a *disconnected constant* — severing them from the projection weights and crashing
`iree-compile` (null-deref in constant folding) on the exported graph. `RoPE` now records the rotation
as tensor ops when running under the tracing wrapper; eager execution keeps the byte-identical raw-array
fast path. Unblocks Llama/Mistral/GGUF StableHLO/IREE export. (019b049)

## [0.31.1] — 2026-06-17

Adds **`transformer-core`** — the framework NN primitives (attention, the KV-cache family, embedding,
Expand Down
48 changes: 34 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,25 +103,27 @@ Honest status — see the project-status note at the top of this README.

## Current release

The current release is **0.31.1** (against **SKaiNET 0.31.0**). It adds
**`transformer-core`** — the framework NN primitives (attention, KV-cache family,
embedding, norms, RoPE, FFNs, linear projection) extracted out of `llm-core` so they
build on the **full target matrix including `androidNative`** (32-bit + 64-bit ARM);
`llm-core` re-exports it, so nothing changes for existing consumers, and ARM-native
downstreams (e.g. on-device whisper) can reuse the primitives instead of reimplementing
them. The 0.31.0 highlights still apply: the eager `NATIVE_OPTIMIZED` Gemma path keeps the
**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
`maxInferenceLen` to cap the KV cache for constrained devices — together
dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).
The current release is **0.32.0** (against **SKaiNET 0.32.0**). It brings the
real-GGUF **Llama** eager path up to the Gemma standard and **unblocks StableHLO/IREE
export for Llama-family models**:

- The eager **`NATIVE_OPTIMIZED` path now works for Llama** (`Q4_K`/`Q6_K`): weights stay
packed and `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) + OptimizedLLMRuntime` decodes
coherently, matching llama.cpp — fixing the packed token-embedding
`gather: unsupported input rank 1`.
- **Fused decode-attention** (`seqQ == 1`) skips the `repeatKVHeads` concat + SDPA plumbing
for a faster decode loop (~1.5×), bit-identical output.
- **Interleaved RoPE is now traceable**, so Llama/Mistral/GGUF graphs export to StableHLO
(and `iree-compile` to a `vmfb`) instead of baking a disconnected constant.

The earlier `transformer-core` extraction (0.31.1) and the Gemma `NATIVE_OPTIMIZED`
footprint work (0.31.0) still apply.

The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.

```kotlin
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))

// Versions resolved from the BOM:
implementation("sk.ainet.transformers:skainet-transformers-core")
Expand Down Expand Up @@ -199,6 +201,24 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n

See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.

## What's new in 0.32.0

- **Eager `NATIVE_OPTIMIZED` for real-GGUF Llama.** `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)`
now keeps `Q4_K`/`Q6_K` weights packed and runs them through `OptimizedLLMRuntime`, mirroring the
Gemma path (new `LlamaQuantLayout` + `LlamaPackedWeights.convertLlamaWeightsPacked`). Output is
coherent and matches llama.cpp; fixes the packed token-embedding `gather: unsupported input rank 1`.
This is the low-footprint path real-GGUF Llama inference on constrained ARM was missing. (ccbd87e)
- **Fused decode-attention fast path.** For the decode step (`seqQ == 1`), `MultiHeadAttention` runs
scores → softmax → GQA-weighted-V straight from the cached K/V, bypassing the `repeatKVHeads` concat
and the `unsqueeze → SDPA → squeeze → permute` chain. ~1.5× decode throughput on the JVM eager path;
bit-for-bit-equivalent output. Prefill keeps the general SDPA path. (3791f88)
- **Traceable interleaved RoPE (graph export).** `RoPE` in `INTERLEAVED` mode (Llama / Mistral / most
GGUF) used a raw-array path (`copyToFloatArray` / `fromFloatArray`) that, under graph tracing, recorded
the rotated Q/K as a *disconnected constant* — severing them from the projection weights and crashing
`iree-compile` downstream. It now records the rotation as tensor ops when tracing (gated on the tracing
wrapper; eager keeps the fast raw-array path byte-identical). Unblocks TinyLlama → StableHLO → IREE. (019b049)
- **Engine pin `skainet 0.31.0 → 0.32.0`.**

## What's new in 0.31.1

- **`transformer-core` module — NN primitives reusable on all targets incl. `androidNative`.** The
Expand Down
4 changes: 2 additions & 2 deletions docs/modules/ROOT/pages/tutorials/getting-started-java.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ In your `build.gradle.kts`:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))

implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
Expand All @@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
<dependency>
<groupId>sk.ainet.transformers</groupId>
<artifactId>skainet-transformers-bom</artifactId>
<version>0.31.1</version>
<version>0.32.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The pieces you need live in three modules:
[source,kotlin]
----
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))

implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
Expand Down
2 changes: 1 addition & 1 deletion gradle.properties
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
GROUP=sk.ainet.transformers
VERSION_NAME=0.31.1
VERSION_NAME=0.32.0

POM_DESCRIPTION=SKaiNET-transformers

Expand Down
2 changes: 1 addition & 1 deletion gradle/libs.versions.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[versions]
skainet = "0.31.0"
skainet = "0.32.0"
agp = "9.2.1"
jacksonDatabind = "2.22.0"
jsonSchemaValidator = "3.0.4"
Expand Down
Loading