Skip to content

[Research] Streaming / incremental tape API for lowering large models #740

Description

@michalharakal

[Research] Streaming / incremental tape API for lowering large models

Component: skainet-lang tape · skainet-compile-hlo · Type: research / enhancement (NOT a correctness bug)

Problem

The current lowering pipeline is buffer-everything:

DefaultGraphExecutionContext.tape(...).record { forward(...) }   // materialises the WHOLE tape
    .toComputeGraph(...)                                          // then builds the FULL graph in memory

Peak memory is therefore O(model size + activation graph) before StableHLO emission even starts —
the entire tape and the entire ComputeGraph co-exist in the heap. This is fine for small models
(Whisper-tiny.en records fine), but it does not scale to large forwards.

Motivation / evidence

Lowering a real multi-GB model pressures memory well before IREE is involved. Concretely, in the
skainet-iree-conformance harness a full TinyLlama-1.1B (22 layers, 1.1B params) forward buffers
the complete tape + graph; combined with weight materialisation this pushes peak heap past what a 1.1B
model should ever need. Even with an efficient weight path, the all-at-once tape/graph is an inherent
O(model) memory floor for lowering.

Research questions

  • Emit-as-recorded: can nodes be streamed to the converter (or to disk / a chunked IR) as they are
    recorded, instead of accumulating the full tape first?
  • Chunk by subgraph / layer: for a repeated-block model (decoder layers), can the tape be lowered
    block-by-block and stitched, so only one block's worth of nodes is live at once?
  • Cross-chunk references: how to handle values that cross chunk boundaries (residual stream,
    externalised weights, KV state) without re-materialising the whole graph.
  • Interaction with output pruning (#760 prunedToOutputs): pruning today operates on the full
    graph; a streaming design needs an equivalent that works on partial graphs.
  • API shape: what does an incremental record/toComputeGraph look like, and can it stay
    source-compatible with the current all-at-once API as the default?

Scope

Research / spike — not a committed feature. Success criteria: a design note plus a prototype that
demonstrably reduces peak heap for a large (≥1B-param) forward versus the buffer-everything path, while
producing identical StableHLO.

Not a correctness bug

The existing buffer-everything path is correct; this is purely about memory scalability of lowering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions