Skip to content

Feature/data loader api#785

Merged
michalharakal merged 19 commits into
developfrom
feature/data-loader-api
Jun 29, 2026
Merged

Feature/data loader api#785
michalharakal merged 19 commits into
developfrom
feature/data-loader-api

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

No description provided.

michalharakal and others added 19 commits June 28, 2026 21:15
Always-on accumulating profiler (quant-NEON / fp32-scalar / generic) on the
DefaultCpuOps.matmul dispatch, read via KernelProfile.report(). Clock read per
call is negligible next to a matmul. Used to localize native board decode cost:
showed 100% of matmul time is the quant-NEON path (fp32-scalar/generic never hit).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…matmul on A55

Two changes to skainet_q4k_matmul, both validated against the Panama
reference (aggregate RMS gate, AGG_REL_TOL=0.03) and on-board generation:

1. Loop order block-OUTER / output-row-INNER. The weight is packed
   block-major (blockIdx*outputDim + o)*144, so for a fixed block
   consecutive `o` are exactly 144 bytes apart — weight bytes are now read
   strictly sequentially (prefetch/cache-line friendly). The previous
   o-outer order strided outputDim*144 (~295 KB on the down-proj) per step,
   making every weight read a cold miss on the in-order A55 with small
   caches. out_base[o] accumulates across blocks (stays hot in cache);
   accumulation order is unchanged so the result is numerically identical.

2. ggml-style Q8 activation quantization + integer vdotq_s32 dot path
   (asimddp), input row quantized once per 256-block and reused across all
   output rows; scalar integer fallback when dotprod is absent.

On the SL2619 (Cortex-A55, TinyLlama Q4_K_M, 8-tok decode), Q4_K matmul
dropped 41730 ms -> 20133 ms (2.07x); end-to-end decode 0.123 -> 0.184
tok/s (1.50x, matmul being ~64% of decode). The loop reorder is the
dominant lever — the Q8 dot alone showed no gain because the kernel was
memory-stall-bound, not compute-bound.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit d6bdc34 into develop Jun 29, 2026
9 checks passed
@michalharakal michalharakal deleted the feature/data-loader-api branch June 29, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant