[Research] Streaming / incremental tape API for lowering large models
Component: skainet-lang tape · skainet-compile-hlo · Type: research / enhancement (NOT a correctness bug)
Problem
The current lowering pipeline is buffer-everything:
DefaultGraphExecutionContext.tape(...).record { forward(...) } // materialises the WHOLE tape
.toComputeGraph(...) // then builds the FULL graph in memory
Peak memory is therefore O(model size + activation graph) before StableHLO emission even starts —
the entire tape and the entire ComputeGraph co-exist in the heap. This is fine for small models
(Whisper-tiny.en records fine), but it does not scale to large forwards.
Motivation / evidence
Lowering a real multi-GB model pressures memory well before IREE is involved. Concretely, in the
skainet-iree-conformance harness a full TinyLlama-1.1B (22 layers, 1.1B params) forward buffers
the complete tape + graph; combined with weight materialisation this pushes peak heap past what a 1.1B
model should ever need. Even with an efficient weight path, the all-at-once tape/graph is an inherent
O(model) memory floor for lowering.
Research questions
- Emit-as-recorded: can nodes be streamed to the converter (or to disk / a chunked IR) as they are
recorded, instead of accumulating the full tape first?
- Chunk by subgraph / layer: for a repeated-block model (decoder layers), can the tape be lowered
block-by-block and stitched, so only one block's worth of nodes is live at once?
- Cross-chunk references: how to handle values that cross chunk boundaries (residual stream,
externalised weights, KV state) without re-materialising the whole graph.
- Interaction with output pruning (#760
prunedToOutputs): pruning today operates on the full
graph; a streaming design needs an equivalent that works on partial graphs.
- API shape: what does an incremental
record/toComputeGraph look like, and can it stay
source-compatible with the current all-at-once API as the default?
Scope
Research / spike — not a committed feature. Success criteria: a design note plus a prototype that
demonstrably reduces peak heap for a large (≥1B-param) forward versus the buffer-everything path, while
producing identical StableHLO.
Not a correctness bug
The existing buffer-everything path is correct; this is purely about memory scalability of lowering.
[Research] Streaming / incremental tape API for lowering large models
Component:
skainet-langtape ·skainet-compile-hlo· Type: research / enhancement (NOT a correctness bug)Problem
The current lowering pipeline is buffer-everything:
Peak memory is therefore O(model size + activation graph) before StableHLO emission even starts —
the entire tape and the entire
ComputeGraphco-exist in the heap. This is fine for small models(Whisper-tiny.en records fine), but it does not scale to large forwards.
Motivation / evidence
Lowering a real multi-GB model pressures memory well before IREE is involved. Concretely, in the
skainet-iree-conformanceharness a full TinyLlama-1.1B (22 layers, 1.1B params) forward buffersthe complete tape + graph; combined with weight materialisation this pushes peak heap past what a 1.1B
model should ever need. Even with an efficient weight path, the all-at-once tape/graph is an inherent
O(model) memory floor for lowering.
Research questions
recorded, instead of accumulating the full tape first?
block-by-block and stitched, so only one block's worth of nodes is live at once?
externalised weights, KV state) without re-materialising the whole graph.
prunedToOutputs): pruning today operates on the fullgraph; a streaming design needs an equivalent that works on partial graphs.
record/toComputeGraphlook like, and can it staysource-compatible with the current all-at-once API as the default?
Scope
Research / spike — not a committed feature. Success criteria: a design note plus a prototype that
demonstrably reduces peak heap for a large (≥1B-param) forward versus the buffer-everything path, while
producing identical StableHLO.
Not a correctness bug
The existing buffer-everything path is correct; this is purely about memory scalability of lowering.