Skip to content

Proposal: additive partition API on the a2_fast (nano/A2-Lite) path for multi-core inference #291

@oyama

Description

@oyama

What

A small additive API on the a2_fast path that lets the layer stack of a Channels=3 (nano / A2-Lite) model be split into contiguous ranges and run on separate instances, so the network can be pipelined across cores or threads. process() is left untouched.

Why

I've been running an A2-Lite capture in real time on an RP2350 (Arm Cortex-M33 dual core, single-precision FPU, no MVE). The a2_fast path is what makes it feasible in the first place — thanks for it.

On a single Cortex-mM33 core at 300 MHz it lands around 8,400 cycles/sample, about 134% of the 48 kHz budget, so it doesn't quite fit one core. Splitting the 23 layers across the two cores (front [0,K]) on one core, back [K,23]) + head on the other, pipelined at block granularity) brings it to ~4,533 cycles/sample — 1.85x over single-core a2_fast, ~73% CPU at 300 MHz, real time without the unstable 400 MHz overclock.

To do that split cleanly I needed to run a contiguous layer range on each instance and pass the residual / head-accumulator / conditioning buffers between them, with head accumulation decoupled so each instance owns its own head ring.

Proposed API (additive; process() unchanged)

// Buffers are caller-owned. Sizes (floats): residual, head = 3*num_frames; cond, out = num_frames.
void* partition_create(const std::vector<float>& weights, double sampleRate, int maxBufferSize);
void  partition_destroy(void* handle);

// Chain head: rechannel raw input into residual + cond.
void partition_input(void* handle, const float* in, int num_frames, float* residual_out, float* cond_out);

// Run layers [begin, end); accumulate head into head_io, advance residual_io.
void partition_layers(void* handle, int begin, int end, int num_frames, const float* cond, float* residual_io, float* head_io);

// Chain tail: run layers [begin, kNumLayers) then the head conv into out.
void partition_output(void* handle, int begin, int num_frames, const float* cond, const float* residual_in, const float* head_in, float* out);

// Single-threaded reference: split layers at `boundaries` across `segments`
// handles and run them in order. Bit-identical to one model's process().
void process_partitioned(const std::vector<void*>& segments, const std::vector<int>& boundaries, const float* in, float* out, int num_frames);

Correctness

The partitioned path is bit-exact against a single model's process(). I verify it host-side across several boundary sets ({12}, {14}, {8,16}, {6,12,18}) with max|err| = 0, and on-device via an output checksum matching the same engine built on the host. a2_fast is data-independent, so this holds for any A2-Lite weights.

Questions before a PR

  • Is this something you'd want upstream, or better kept as a downstream extension?
  • API shape: I used opaque void* handles to keep the header light, but a typed handle (A2FastModel<3>*) or a small class wrapping the segments might fit the codebase better.
  • Header-only vs .cpp split, naming, and whether process_partitioned belongs in the engine or stays a test-only helper.

Happy to open a PR against whatever shape you prefer. Working firmware and the full benchmark write-up for context: https://github.com/oyama/pico-neural-amp-modeler-demo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions