Feature/data loader api by michalharakal · Pull Request #785 · SKaiNET-developers/SKaiNET

michalharakal · 2026-06-29T20:53:40Z

No description provided.

Always-on accumulating profiler (quant-NEON / fp32-scalar / generic) on the DefaultCpuOps.matmul dispatch, read via KernelProfile.report(). Clock read per call is negligible next to a matmul. Used to localize native board decode cost: showed 100% of matmul time is the quant-NEON path (fp32-scalar/generic never hit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…matmul on A55 Two changes to skainet_q4k_matmul, both validated against the Panama reference (aggregate RMS gate, AGG_REL_TOL=0.03) and on-board generation: 1. Loop order block-OUTER / output-row-INNER. The weight is packed block-major (blockIdx*outputDim + o)*144, so for a fixed block consecutive `o` are exactly 144 bytes apart — weight bytes are now read strictly sequentially (prefetch/cache-line friendly). The previous o-outer order strided outputDim*144 (~295 KB on the down-proj) per step, making every weight read a cold miss on the in-order A55 with small caches. out_base[o] accumulates across blocks (stays hot in cache); accumulation order is unchanged so the result is numerically identical. 2. ggml-style Q8 activation quantization + integer vdotq_s32 dot path (asimddp), input row quantized once per 256-block and reused across all output rows; scalar integer fallback when dotprod is absent. On the SL2619 (Cortex-A55, TinyLlama Q4_K_M, 8-tok decode), Q4_K matmul dropped 41730 ms -> 20133 ms (2.07x); end-to-end decode 0.123 -> 0.184 tok/s (1.50x, matmul being ~64% of decode). The loop reorder is the dominant lever — the Q8 dot alone showed no gain because the kernel was memory-stall-bound, not compute-bound. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

michalharakal and others added 19 commits June 28, 2026 21:15

data: add dataset operation views

0366365

data: enrich data batch metadata

4d4cb01

data: add URI source contracts

a8642b9

data: materialize JVM source artifacts

18fdae7

data: route simple loaders through sources

27841eb

docs: explain data source URIs

75ac460

data: share source resolver core

130702f

data: stream source artifacts with kotlinx-io

40a1ab7

data: parameterize Hugging Face auth

35ad833

data: support indexed simple batches

de57f87

data: report unsupported loader targets

7074a65

data: add raw format parsers

34ff3fb

data: parse json raw datasets

a17266a

data: add data source dataset builder

5655c86

data: add suspend data pipeline DSL

b0eee98

docs: document data loader APIs

ce4640f

Merge origin/develop into feature/data-loader-api

1bc57a2

michalharakal merged commit d6bdc34 into develop Jun 29, 2026
9 checks passed

michalharakal deleted the feature/data-loader-api branch June 29, 2026 21:22

michalharakal mentioned this pull request Jun 29, 2026

Huggingface integration #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/data loader api#785

Feature/data loader api#785
michalharakal merged 19 commits into
developfrom
feature/data-loader-api

michalharakal commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michalharakal commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant