GGUF DEQUANTIZE_TO_FP32 over-allocates: 1.1B Q4_K_M needs >12 GB heap transiently (~4.4 GB legit)

## Summary

Loading a 1.1B-parameter Q4_K_M GGUF via `LlamaNetworkLoader.fromGguf(..., QuantPolicy.DEQUANTIZE_TO_FP32)` over-allocates badly: the legitimate resident cost is **~4.4 GB** (dense FP32), but the dequant path transiently needs **>12 GB heap** to get there. The model only loads cleanly with a **32 GB heap** — unreasonable for a 1.1B model.

This is a **transient-allocation / extra-copy problem in the dequant path**, not a correctness bug — the produced tensors are correct.

## Numbers

- Model: **TinyLlama-1.1B-Chat v1.0, Q4_K_M** (637 MB GGUF on disk).
- Parses correctly: **202 parameter tensors, ~1.1B params**, correct shapes (`token_embd [32000, 2048]`, GQA `k_proj [256, 2048]`).
- Legitimate dense FP32 footprint: **~4.4 GB** (1.1e9 params × 4 bytes).
- Observed peak heap to complete the load: **>12 GB transient** (fails below; needs 32 GB to be safe) — roughly **~3× the dense floor**.

## Likely cause

The `DEQUANTIZE_TO_FP32` path (`QuantPolicy.kt` → `skainet-io-gguf` dequant, `Quants.kt`) appears to materialize **boxed `Float` / intermediate copies** rather than unpacking each Q4_K/Q6_K block directly into a primitive `FloatArray`. With ~4.4 GB of final data, even one extra full-size intermediate (plus boxing overhead) blows past 12 GB.

## Why it matters

This blocks adding **TinyLlama-1.1B as a real-weights conformance/export reference** on the IREE conformance side. The export use case (trace → StableHLO with weights baked as constants) fundamentally needs the ~4.4 GB FP32 resident, so the lazy `RowDequantSource` / `ops.gather` row-dequant path (a24f21d0) does **not** help here — the issue is specifically the **transient overshoot above the 4.4 GB floor** during eager full materialization.

## Suggested fix

- Stream the Q4_K/Q6_K unpack **block-by-block straight into the destination `FloatArray`** (no boxed `Float`, no full-size intermediate copy).
- Target: peak heap ≈ dense FP32 footprint + a small per-block scratch (i.e. **~5–6 GB**, not >12 GB, for this model).

## Notes / related

- Do **not** work around with a giant `-Xmx`; that masks the issue and is unusable in CI.
- Related but distinct: #740 (streaming/incremental tape for *lowering* large models) covers the next downstream step (trace/export memory at scale), not this load-time dequant.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GGUF DEQUANTIZE_TO_FP32 over-allocates: 1.1B Q4_K_M needs >12 GB heap transiently (~4.4 GB legit) #782

Summary

Numbers

Likely cause

Why it matters

Suggested fix

Notes / related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

GGUF DEQUANTIZE_TO_FP32 over-allocates: 1.1B Q4_K_M needs >12 GB heap transiently (~4.4 GB legit) #782

Description

Summary

Numbers

Likely cause

Why it matters

Suggested fix

Notes / related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions