For architecture details see ARCHITECTURE.md.
SKaiNET is a Kotlin Multiplatform AI framework. New here? Choose the path that matches what you want to try first.
| Goal | Start here | Time |
|---|---|---|
| Run tensor operations | Quickstart (below) | 2–5 min |
| Build and train a neural net | Hello Neural Net (below) | 5 min |
| Run a local GGUF model | SKaiNET Transformers starter | 5 min after model setup |
| Export a secure MCU bundle | Minerva getting started | 10 min without firmware flashing |
Working in Java? SKaiNET ships first-class Java support — see the Java getting-started guide.
Use the version shown in this README as the source of truth for first-run snippets. If another page shows a different version, please open an issue or PR.
Add the core dependencies (Gradle Kotlin DSL):
dependencies {
// Recommended: import the umbrella BOM and drop versions on the engine modules.
implementation(platform("sk.ainet:skainet-bom:0.33.0"))
implementation("sk.ainet.core:skainet-lang-core")
implementation("sk.ainet.core:skainet-backend-cpu")
}val model = nn {
input(28 * 28)
dense(out = 128)
relu()
dense(out = 10)
}val a = tensor(shape(2, 2)) { float(1f, 2f, 3f, 4f) }
val b = tensor(shape(2, 2)) { float(5f, 6f, 7f, 8f) }
val c = a matMul b
val d = c.relu()// Recommended: streaming reader — memory-efficient, supports quantized types
val source = JvmRandomAccessSource.open("model.gguf")
StreamingGGUFReader.open(source).use { reader ->
println("Tensors: ${reader.tensorCount}")
// Load specific tensor on demand (no whole-file loading)
val bytes = reader.loadTensor("token_embd.weight")
// Or get a TensorStorage descriptor with encoding/placement metadata
val storage = reader.loadTensorStorage("token_embd.weight")
}More examples: SKaiNET-examples | SKaiNET-notebook
SKaiNET is a modular ecosystem. While this repository contains the core engine, specialized high-level libraries are maintained in standalone repositories:
| Project | Description |
|---|---|
| SKaiNET-transformers | Pre-built transformer architectures and layers |
| SKaiNET-examples | Sample projects and integration demos |
| Goal | Start here |
|---|---|
| Examples and sample projects | SKaiNET-examples |
| Interactive notebooks | SKaiNET-notebook |
| Eager backends & kernels (what runs where) | Backends & kernels mindmap |
| Design proposals and long-lived API decisions | SKEEP proposals |
Small fixes can go straight through the normal contribution flow described in CONTRIBUTING.md and GITFLOW.adoc.
Use a SKEEP when a change affects public APIs, DSL syntax, tensor semantics,
compiler/runtime integration, storage behavior, compatibility policy, or other
decisions that need a durable design record. SKEEP files live under
docs/modules/skeep/pages/ and use three-digit numbering, starting with
001.
SKaiNET ships an official Phoronix-Test-Suite-compatible benchmark
program for the compute engine. See the
methodology and replay docs,
the release manifest, and the
CI workflow. Smoke runs fire
on every PR via ubuntu-latest; full publishable runs fire on a
self-hosted Linux x86 runner on release.
Quick local replay:
./gradlew :skainet-backends:benchmarks:jvm-cpu-publish:shadowJar
./scripts/run_engine_smoke.shSKaiNET is built around one path: a model is defined once in the Kotlin DSL, then either compiled or executed eagerly — without rewriting it.
- Define the model with the DSL (
nn { }/dag { }). - Capture it as a tape (traced execution) or a DAG (explicit graph) — a
ComputeGraph. - Run it one of two ways:
- Compile — lower the captured
ComputeGraphthrough one of several sibling code-generation backends, each emitting code for a different target from the same graph:- StableHLO / MLIR (
HloGenerator) → IREE-compilable, for native / edge / accelerator targets and the wider MLIR ecosystem. - Arduino / C99 → standalone, statically-allocated C for microcontrollers.
- Minerva → a secure-MCU bundle (weights + firmware skeleton + fingerprinted manifest).
- StableHLO / MLIR (
- Eager — execute directly on an available backend. On the JVM this is the primary, go-to path.
- Compile — lower the captured
StableHLO/MLIR is therefore one code-generation backend among siblings — the IREE/native path next to the C99/Arduino and Minerva MCU paths — not a separate pipeline.
flowchart LR
DSL["Model — Kotlin DSL"] --> Graph["Tape / DAG (ComputeGraph)"]
Graph --> Eager["Eager backend (JVM, …)"]
Graph -->|code generation| HLO["StableHLO / MLIR"]
Graph -->|code generation| C99["Arduino / C99"]
Graph -->|code generation| Minerva["Minerva"]
HLO --> Native["IREE → native / edge / accelerator"]
C99 --> MCU["Microcontroller"]
Minerva --> SecMCU["Secure-MCU bundle"]
The same DSL model feeds every path: eager execution for development and JVM deployment, and the code-generation backends — StableHLO/MLIR (→ IREE), Arduino/C99, and Minerva — as sibling alternatives for native, edge, and secure-MCU targets.
SKaiNET now includes a Minerva export backend for secure MCU deployment. It is a sibling to StableHLO and Arduino/C99 export: it starts from a supported ComputeGraph, lowers static MLPs to a Minerva compiler input, invokes libminerva when configured, and packages generated weights, host fixtures, firmware skeletons, and a fingerprinted manifest.json.
Start here:
- Minerva getting started — run the maintained tiny MLP dry sample, then the real libminerva runtime profile.
- Minerva export how-to — configure compiler paths, keys, calibration, CMake/CTest host verification, and troubleshooting.
- How Minerva secure MCU export fits — understand why Minerva is not an Arduino replacement and when to choose StableHLO instead.
Runnable examples:
./gradlew :skainet-compile:skainet-compile-minerva:runMinervaSecureMcuExamples
./gradlew :skainet-compile:skainet-compile-minerva:runMinervaSecureMcuExamples \
-Pminerva.example=sensor-classifier- Targets: JVM, macOS (Native), JS, WASM (Browser + WasmWasi)
- Single codebase shared across all platforms via Kotlin Multiplatform
- ComputeGraphExecutor: Optimized engine with fusion passes and trace-to-DAG bridging.
- SDPA & Gather: High-performance Scaled Dot-Product Attention and indexing operations.
- TurboQuant: Runtime KV-cache compression (~8x at 4-bit) for long-context LLM inference. Presets:
safe-lowbit,balanced,experimental-max. SeeTurboQuantUsagefor integration guide.
- Sequential:
nn { input(); dense(); relu(); dense() } - DAG / Graph: arbitrary wiring with
dag { }for ResNet, YOLO-style architectures - Layers: Dense, Conv1d/2d/3d, MaxPool, AvgPool, BatchNorm, Dropout, LeakyReLU, ELU
- KAN (Kolmogorov–Arnold Networks) layer (experimental)
- Autograd engine with reverse-mode gradients, SGD and Adam/AdamW optimizers
- Built-in loaders: MNIST, Fashion-MNIST, CIFAR-10
- URI-backed data sources:
file://,https://,hf+https://, andhf://... - Dataset operations: deterministic shuffle/split, stratified split, filter/map/transform views, batch flows, and epoch flows
- Raw dataset parsers: CSV, TSV, JSON arrays/objects, JSON Lines (
.jsonl,.ndjson) - Type-safe transform DSLs: image/tensor transforms plus suspendable raw data pipelines
- Formats: GGUF, ONNX, SafeTensors, JSON, Image (JPEG, PNG)
val raw = JvmDataSourceResolver().rawDataset {
from("hf://datasets/org/repo@main/train.jsonl")
format(DataFormat.JSON_LINES)
cachePolicy(CachePolicy.Use)
}
val withoutLabel = dataPipeline<RawDataset>()
.stage(
dataTransformer(
name = "drop-label",
outputSchema = { schema -> DataSchema(schema.columns - "label") }
) { dataset ->
val columns = dataset.schema.columns - "label"
dataset.copy(
schema = DataSchema(columns),
rows = dataset.rows.map { row ->
RawDataRow(row.values.filterKeys { key -> key in columns })
}
)
}
)
.execute(raw)- Export trained models to standalone, optimized C99 with static memory allocation
- Ready-to-use Arduino library output
- Export supported static MLP graphs to Minerva project bundles for secure MCU inference
- Emits compiler NPZ input, libminerva weights, a fingerprinted manifest, host harness, firmware example, and host verification results
- Start with the Minerva getting started guide
- Lower Kotlin DSL to MLIR StableHLO dialect
- Optimization passes: constant folding, operation fusion, dead code elimination
- Valid IREE-compilable output with streaming API and public
HloGenerator
- Use StableHLO when you want portable MLIR/IREE-compatible graphs for native, accelerator, or ecosystem compiler flows.
- Use Arduino / C99 export when you want standalone generated C with static memory allocation and no external secure runtime.
- Use Minerva export when you need a secure MCU project bundle that goes through libminerva packaging and host verification.
- GRU — the first recurrent layer.
nn.Gru([B,S,D]->[B,S,H], PyTorch gate order) composed from existing primitives and unrolled over the static sequence at trace time, so it runs eagerly, trains through the standard tape, and exports to StableHLO with no dedicated converter. Plus agru(…)network-DSL builder. (PR #772, issue #217) upsample2dBilinear + StableHLO export for both Nearest and Bilinear — everything lowers to fixed reshape/broadcast/dot_general(nocustom_call), unblocking resize/FPN-style export. (PR #771)- Autodiff correctness + coverage. Fixes a silent gradient-drop for
elu/leakyRelu/permute(backward rules existed but were never wired into the trace dispatch), makescos/sin/tril/gather/indexSelect/unfold/convTranspose1ddifferentiable, and adds a KSP-generated coverage guard so a differentiable op can no longer ship without a wired backward. (PR #774) - Norms compile on stock IREE.
layerNorm/rmsNorm/batchNormnow lower to realstablehlo.reduceinstead of export-onlycustom_calls. (PR #769) - Breaking:
TensorOps.sin/cos/convTranspose1dare now abstract — backends implementingTensorOpsdirectly must override them (both bundled backends already do).
- Streaming detokenization keeps word spaces (
Tokenizer.decodeToken). Decoding generated tokens one at a time no longer runs words together ("the process"→"theprocess"). The newdecodeToken(id)keeps each SentencePiece piece's leading space (llama.cpptoken_to_piecesemantics);decode(IntArray)still strips the single sequence-leading space as before.
- Graph-output pruning for export (
ComputeGraph.prunedToOutputs). Trims a traced decoder's StableHLO/IREE export to just the designated outputs (e.g. the logits), eliminating the dozens of dangling per-layer tensors and dead op subgraphs a full trace otherwise emits asfuncreturns — via newOutputDesignatedGraph(compile-dag) +prunedToOutputs(compile-opt) runningDeadCodeEliminationPass. (PR #760) - SDPA causal mask now emits a large finite fill (
-1e30) instead of-inf, matchingbuildSlidingCausalMaskand avoiding a-infsplat in the exported IR (numerically equivalent after softmax). (AttentionOperationsConverter)
ExecutionContext.isRecording. A default-falseflag (overridden by the graph/tape context) so a module with an eager fast-path that bypassesops.*— e.g. RoPE's raw-array INTERLEAVED rotation — can detect tracing and emit a graph-traceablectx.ops.*path instead, exporting to StableHLO while keeping the eager fast path. Backward-compatible. (PR #757)- Docs: Antora version-currency + broken-link fixes across all pages (PR #758).
- Dependency:
ch.qos.logback:logback-classic→ 1.5.35 (#756).
- GroupNorm compiles on stock IREE. The 0.32.0 GroupNorm converter emitted
@reduce_mean/@reduce_variancecustom_calls thatiree-compilecan't lower; it now emits realstablehlo.reduce(variance asE[x²] − E[x]², ddof=0), likesum/mean/variance. Verified end-to-end through theskainet-iree-conformanceharness (iree-compile+iree-run-module+ numpy validate → PASS,max_abs_err = 1.2e-7). (PR #754)
-
0.32.0 — GroupNorm StableHLO converter (#752):
groupNormlowers to realstablehlo.*ops; plus a SKEEP proposals docs module (#750), a quantization-process explanation (#747), and dependency bumps. -
0.31.2 —
RowDequantSource+ops.gatherrow-dequant: a packed/oversized embedding (a Q-quantisedtoken_embd) stays packed and is looked up viaops.gather, dequantising only the touched rows. (PR #741) -
0.31.0 —
ops.transposelazily handles every packed matmul dtype (Q8_0/Q4_0 added, completing the Q4_K/Q5_K/Q6_K/Q5_0/Q5_1/Q8_0/Q4_0 set);json-schema-validator→ 3.0.4. (PRs #736, #737, #733) -
0.30.0 — First-class Q5_K packed in-kernel dequant-matmul across the CPU backends (
Q5_KBlockTensorData+Q5KMatmulKernelSPI: scalar / Panama Vector / native-C), hand-written ARM NEON kernels (fp32/q8_0/q4k/q5k,-march=armv8.2-a+fp16+dotprod), and Kotlin/Native consumption of the C kernels via cinterop (skainet-backend-native-cpustatic archive +linuxX64/linuxArm64KernelProvider). (PR #734) -
0.29.1 —
sk.ainet.core:skainet-compile-minervanow publishes to Maven Central (packaging fix for the Minerva export module shipped in 0.29.0). -
0.29.0 — Minerva secure-MCU export module: an end-to-end pipeline that lowers a SKaiNET model through shared graph-export contracts → Minerva IR → an
.npzcompiler input → a libminerva-packaged secure MCU project bundle, with host-side runtime verification and fingerprinted manifest artifacts (runnable sample, examples, ONNX workflow, getting-started docs). Plus packed-quant matmul kernels with Kotlin/Native parity (Q5_0/Q5_1/Q4_K/Q6_K — commonMain scalar + SPI, packed-quant dispatch inDefaultCpuOpsBase, Panama Vector for Q5_1/Q5_0 and Q6_K via theKernelRegistry), and an auto-generated, CI-gated kernel × platform support matrix. (PRs #697–#726) -
0.28.1 — Kotlin DSL → StableHLO → IREE is green end-to-end for the whole conformance suite (7/7 models, 27/27 ops compile to a
vmfb):inferDagOutputSpecsnow infers correct output shapes for shape-changing ops, andreduce_window(pooling) emits IREE's generic region form. (PRs #674, #676) -
0.28.0 — Four StableHLO export bugs fixed (reshape #666, concatenate #667, constants/reductions #663,
HloGeneratortracing #668) plus non-JVM image runtime support (#671). (PRs #664, #670, #671) -
0.27.0 — A full gemma3 network lowers to StableHLO and compiles to an IREE
vmfb(zero op gaps, verified byGemmaTraceTest): newscaledDotProductAttention(with causal + explicit additive mask),permute,narrow, and multi-outputsplitconverters, plus boxing-freeFloatArrayweight externalization for.irpabaking. (PRs #661 et al.) -
0.26.0 — Q4_0 promoted to a first-class quantized format across the provider stack,
tanhas a first-class activation primitive, and a CPU tensorconvertop, plus test/build/CI hygiene. (PRs #648–#651, #631, #636) -
0.25.0 — BF16 and Q8_0 matmul kernels end-to-end across the provider stack, autograd completeness for
pow/logand the conv/pool/upsample/split family, the hybrid adaptive dtype-constraint DSL, the@DarcValidatedoperator-doc flag, and the SentencePiece special-token splitter. (PRs #595, #605–#628) -
0.23.0 — Real-model GGUFs no longer OOM at network construction (lazy
TensorDataFactory.placeholder(...)); Kotlin/Native can finally load GGUFs over 2 GiB via the new POSIX-pread-backedPosixPreadRandomAccessSource. (Issues #587, #589; PRs #588, #591) -
0.22.2 —
sk.ainet:skainet-bomnow resolves from Maven Central (earlier versions shipped at the wrong coordinates). (Issue #584) -
0.22.1 —
StreamingShardedSafeTensorsReader.loadTensorStorageMappedfor zero-copy reads of multi-shard tensors above the 2 GB JVMByteArraylimit. (PR #582) -
0.22.0 — Native (FFM) CPU kernel provider: 4–6× faster Q4_K matmul, 1.5–1.8× FP32 SGEMM vs Panama Vector; auto-selected via
KernelRegistry.bestAvailable(). (PR #571)
See CHANGELOG.md for the full release history.
- Q1 2026: Comprehensive documentation ✅
- Q2 2026: TurboQuant KV-cache compression ✅ (shipped in 0.18.0); Qwen/LLaMA tokenizers ✅ (shipped in 0.20.0)
- Q3 2026: Agentic AI enhancements ✅ (tool calling shipped in 0.13.0; ongoing)
- Q4 2026: Federated learning support for multi-device training
We love contributions! Whether it's a new operator, documentation, or a bug fix:
- Read our Contribution Guide.
- Check the Good First Issues.
- Open a discussion or issue on GitHub.
Browse the full codebase documentation on DeepWiki.
- Dhia Chemingui (@dhiaspaner) — Android KMP plugin migration (#385, #386)
MIT — see LICENCE.
