What
A small additive API on the a2_fast path that lets the layer stack of a Channels=3 (nano / A2-Lite) model be split into contiguous ranges and run on separate instances, so the network can be pipelined across cores or threads. process() is left untouched.
Why
I've been running an A2-Lite capture in real time on an RP2350 (Arm Cortex-M33 dual core, single-precision FPU, no MVE). The a2_fast path is what makes it feasible in the first place — thanks for it.
On a single Cortex-mM33 core at 300 MHz it lands around 8,400 cycles/sample, about 134% of the 48 kHz budget, so it doesn't quite fit one core. Splitting the 23 layers across the two cores (front [0,K]) on one core, back [K,23]) + head on the other, pipelined at block granularity) brings it to ~4,533 cycles/sample — 1.85x over single-core a2_fast, ~73% CPU at 300 MHz, real time without the unstable 400 MHz overclock.
To do that split cleanly I needed to run a contiguous layer range on each instance and pass the residual / head-accumulator / conditioning buffers between them, with head accumulation decoupled so each instance owns its own head ring.
Proposed API (additive; process() unchanged)
// Buffers are caller-owned. Sizes (floats): residual, head = 3*num_frames; cond, out = num_frames.
void* partition_create(const std::vector<float>& weights, double sampleRate, int maxBufferSize);
void partition_destroy(void* handle);
// Chain head: rechannel raw input into residual + cond.
void partition_input(void* handle, const float* in, int num_frames, float* residual_out, float* cond_out);
// Run layers [begin, end); accumulate head into head_io, advance residual_io.
void partition_layers(void* handle, int begin, int end, int num_frames, const float* cond, float* residual_io, float* head_io);
// Chain tail: run layers [begin, kNumLayers) then the head conv into out.
void partition_output(void* handle, int begin, int num_frames, const float* cond, const float* residual_in, const float* head_in, float* out);
// Single-threaded reference: split layers at `boundaries` across `segments`
// handles and run them in order. Bit-identical to one model's process().
void process_partitioned(const std::vector<void*>& segments, const std::vector<int>& boundaries, const float* in, float* out, int num_frames);
Correctness
The partitioned path is bit-exact against a single model's process(). I verify it host-side across several boundary sets ({12}, {14}, {8,16}, {6,12,18}) with max|err| = 0, and on-device via an output checksum matching the same engine built on the host. a2_fast is data-independent, so this holds for any A2-Lite weights.
Questions before a PR
- Is this something you'd want upstream, or better kept as a downstream extension?
- API shape: I used opaque
void* handles to keep the header light, but a typed handle (A2FastModel<3>*) or a small class wrapping the segments might fit the codebase better.
- Header-only vs .cpp split, naming, and whether
process_partitioned belongs in the engine or stays a test-only helper.
Happy to open a PR against whatever shape you prefer. Working firmware and the full benchmark write-up for context: https://github.com/oyama/pico-neural-amp-modeler-demo
What
A small additive API on the
a2_fastpath that lets the layer stack of a Channels=3 (nano / A2-Lite) model be split into contiguous ranges and run on separate instances, so the network can be pipelined across cores or threads.process()is left untouched.Why
I've been running an A2-Lite capture in real time on an RP2350 (Arm Cortex-M33 dual core, single-precision FPU, no MVE). The
a2_fastpath is what makes it feasible in the first place — thanks for it.On a single Cortex-mM33 core at 300 MHz it lands around 8,400 cycles/sample, about 134% of the 48 kHz budget, so it doesn't quite fit one core. Splitting the 23 layers across the two cores (front
[0,K]) on one core, back[K,23]) + head on the other, pipelined at block granularity) brings it to ~4,533 cycles/sample — 1.85x over single-corea2_fast, ~73% CPU at 300 MHz, real time without the unstable 400 MHz overclock.To do that split cleanly I needed to run a contiguous layer range on each instance and pass the residual / head-accumulator / conditioning buffers between them, with head accumulation decoupled so each instance owns its own head ring.
Proposed API (additive;
process()unchanged)Correctness
The partitioned path is bit-exact against a single model's
process(). I verify it host-side across several boundary sets ({12}, {14}, {8,16}, {6,12,18}) with max|err| = 0, and on-device via an output checksum matching the same engine built on the host.a2_fastis data-independent, so this holds for any A2-Lite weights.Questions before a PR
void*handles to keep the header light, but a typed handle (A2FastModel<3>*) or a small class wrapping the segments might fit the codebase better.process_partitionedbelongs in the engine or stays a test-only helper.Happy to open a PR against whatever shape you prefer. Working firmware and the full benchmark write-up for context: https://github.com/oyama/pico-neural-amp-modeler-demo