Proposal: additive partition API on the a2_fast (nano/A2-Lite) path for multi-core inference

## What

A small additive API on the `a2_fast` path that lets the layer stack of a Channels=3 (nano / A2-Lite) model be split into contiguous ranges and run on separate instances, so the network can be pipelined across cores or threads. `process()` is left untouched.

## Why

I've been running an A2-Lite capture in real time on an __RP2350__ (Arm Cortex-M33 dual core, single-precision FPU, no MVE). The `a2_fast` path is what makes it feasible in the first place — thanks for it.

On a single Cortex-mM33 core at 300 MHz it lands around 8,400 cycles/sample, about 134% of the 48 kHz budget, so it doesn't quite fit one core. Splitting the 23 layers across the two cores (front `[0,K]`) on one core, back `[K,23]`) + head on the other, pipelined at block granularity) brings it to ~4,533 cycles/sample — 1.85x over single-core `a2_fast`, ~73% CPU at 300 MHz, real time without the unstable 400 MHz overclock.

To do that split cleanly I needed to run a contiguous layer range on each instance and pass the residual / head-accumulator / conditioning buffers between them, with head accumulation decoupled so each instance owns its own head ring.

## Proposed API (additive; `process()` unchanged)

```C
// Buffers are caller-owned. Sizes (floats): residual, head = 3*num_frames; cond, out = num_frames.
void* partition_create(const std::vector<float>& weights, double sampleRate, int maxBufferSize);
void  partition_destroy(void* handle);

// Chain head: rechannel raw input into residual + cond.
void partition_input(void* handle, const float* in, int num_frames, float* residual_out, float* cond_out);

// Run layers [begin, end); accumulate head into head_io, advance residual_io.
void partition_layers(void* handle, int begin, int end, int num_frames, const float* cond, float* residual_io, float* head_io);

// Chain tail: run layers [begin, kNumLayers) then the head conv into out.
void partition_output(void* handle, int begin, int num_frames, const float* cond, const float* residual_in, const float* head_in, float* out);

// Single-threaded reference: split layers at `boundaries` across `segments`
// handles and run them in order. Bit-identical to one model's process().
void process_partitioned(const std::vector<void*>& segments, const std::vector<int>& boundaries, const float* in, float* out, int num_frames);
```

## Correctness

The partitioned path is bit-exact against a single model's `process()`. I verify it host-side across several boundary sets ({12}, {14}, {8,16}, {6,12,18}) with max|err| = 0, and on-device via an output checksum matching the same engine built on the host. `a2_fast` is data-independent, so this holds for any A2-Lite weights.

## Questions before a PR

- Is this something you'd want upstream, or better kept as a downstream extension?
- API shape: I used opaque `void*` handles to keep the header light, but a typed handle (`A2FastModel<3>*`) or a small class wrapping the segments might fit the codebase better.
- Header-only vs .cpp split, naming, and whether `process_partitioned` belongs in the engine or stays a test-only helper.

Happy to open a PR against whatever shape you prefer. Working firmware and the full benchmark write-up for context: https://github.com/oyama/pico-neural-amp-modeler-demo


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: additive partition API on the a2_fast (nano/A2-Lite) path for multi-core inference #291

What

Why

Proposed API (additive; `process()` unchanged)

Correctness

Questions before a PR

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: additive partition API on the a2_fast (nano/A2-Lite) path for multi-core inference #291

Description

What

Why

Proposed API (additive; process() unchanged)

Correctness

Questions before a PR

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Proposed API (additive; `process()` unchanged)