[Research] Streaming / incremental tape API for lowering large models

# [Research] Streaming / incremental tape API for lowering large models

**Component:** `skainet-lang` tape · `skainet-compile-hlo` · **Type:** research / enhancement (NOT a correctness bug)

## Problem

The current lowering pipeline is **buffer-everything**:

```
DefaultGraphExecutionContext.tape(...).record { forward(...) }   // materialises the WHOLE tape
    .toComputeGraph(...)                                          // then builds the FULL graph in memory
```

Peak memory is therefore O(model size + activation graph) *before* StableHLO emission even starts —
the entire tape and the entire `ComputeGraph` co-exist in the heap. This is fine for small models
(Whisper-tiny.en records fine), but it does not scale to large forwards.

## Motivation / evidence

Lowering a real multi-GB model pressures memory well before IREE is involved. Concretely, in the
`skainet-iree-conformance` harness a full **TinyLlama-1.1B** (22 layers, 1.1B params) forward buffers
the complete tape + graph; combined with weight materialisation this pushes peak heap past what a 1.1B
model should ever need. Even with an efficient weight path, the all-at-once tape/graph is an inherent
O(model) memory floor for lowering.

## Research questions

- **Emit-as-recorded:** can nodes be streamed to the converter (or to disk / a chunked IR) as they are
  recorded, instead of accumulating the full tape first?
- **Chunk by subgraph / layer:** for a repeated-block model (decoder layers), can the tape be lowered
  block-by-block and stitched, so only one block's worth of nodes is live at once?
- **Cross-chunk references:** how to handle values that cross chunk boundaries (residual stream,
  externalised weights, KV state) without re-materialising the whole graph.
- **Interaction with output pruning** ([#760] `prunedToOutputs`): pruning today operates on the full
  graph; a streaming design needs an equivalent that works on partial graphs.
- **API shape:** what does an incremental `record`/`toComputeGraph` look like, and can it stay
  source-compatible with the current all-at-once API as the default?

## Scope

Research / spike — **not** a committed feature. Success criteria: a design note plus a prototype that
demonstrably reduces peak heap for a large (≥1B-param) forward versus the buffer-everything path, while
producing identical StableHLO.

## Not a correctness bug

The existing buffer-everything path is correct; this is purely about memory scalability of lowering.

[#760]: https://github.com/SKaiNET-developers/SKaiNET/pull/760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Research] Streaming / incremental tape API for lowering large models #740

[Research] Streaming / incremental tape API for lowering large models

Problem

Motivation / evidence

Research questions

Scope

Not a correctness bug

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Research] Streaming / incremental tape API for lowering large models #740

Description

[Research] Streaming / incremental tape API for lowering large models

Problem

Motivation / evidence

Research questions

Scope

Not a correctness bug

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions