Summary
Loading a 1.1B-parameter Q4_K_M GGUF via LlamaNetworkLoader.fromGguf(..., QuantPolicy.DEQUANTIZE_TO_FP32) over-allocates badly: the legitimate resident cost is ~4.4 GB (dense FP32), but the dequant path transiently needs >12 GB heap to get there. The model only loads cleanly with a 32 GB heap — unreasonable for a 1.1B model.
This is a transient-allocation / extra-copy problem in the dequant path, not a correctness bug — the produced tensors are correct.
Numbers
- Model: TinyLlama-1.1B-Chat v1.0, Q4_K_M (637 MB GGUF on disk).
- Parses correctly: 202 parameter tensors, ~1.1B params, correct shapes (
token_embd [32000, 2048], GQA k_proj [256, 2048]).
- Legitimate dense FP32 footprint: ~4.4 GB (1.1e9 params × 4 bytes).
- Observed peak heap to complete the load: >12 GB transient (fails below; needs 32 GB to be safe) — roughly ~3× the dense floor.
Likely cause
The DEQUANTIZE_TO_FP32 path (QuantPolicy.kt → skainet-io-gguf dequant, Quants.kt) appears to materialize boxed Float / intermediate copies rather than unpacking each Q4_K/Q6_K block directly into a primitive FloatArray. With ~4.4 GB of final data, even one extra full-size intermediate (plus boxing overhead) blows past 12 GB.
Why it matters
This blocks adding TinyLlama-1.1B as a real-weights conformance/export reference on the IREE conformance side. The export use case (trace → StableHLO with weights baked as constants) fundamentally needs the ~4.4 GB FP32 resident, so the lazy RowDequantSource / ops.gather row-dequant path (a24f21d) does not help here — the issue is specifically the transient overshoot above the 4.4 GB floor during eager full materialization.
Suggested fix
- Stream the Q4_K/Q6_K unpack block-by-block straight into the destination
FloatArray (no boxed Float, no full-size intermediate copy).
- Target: peak heap ≈ dense FP32 footprint + a small per-block scratch (i.e. ~5–6 GB, not >12 GB, for this model).
Notes / related
🤖 Generated with Claude Code
Summary
Loading a 1.1B-parameter Q4_K_M GGUF via
LlamaNetworkLoader.fromGguf(..., QuantPolicy.DEQUANTIZE_TO_FP32)over-allocates badly: the legitimate resident cost is ~4.4 GB (dense FP32), but the dequant path transiently needs >12 GB heap to get there. The model only loads cleanly with a 32 GB heap — unreasonable for a 1.1B model.This is a transient-allocation / extra-copy problem in the dequant path, not a correctness bug — the produced tensors are correct.
Numbers
token_embd [32000, 2048], GQAk_proj [256, 2048]).Likely cause
The
DEQUANTIZE_TO_FP32path (QuantPolicy.kt→skainet-io-ggufdequant,Quants.kt) appears to materialize boxedFloat/ intermediate copies rather than unpacking each Q4_K/Q6_K block directly into a primitiveFloatArray. With ~4.4 GB of final data, even one extra full-size intermediate (plus boxing overhead) blows past 12 GB.Why it matters
This blocks adding TinyLlama-1.1B as a real-weights conformance/export reference on the IREE conformance side. The export use case (trace → StableHLO with weights baked as constants) fundamentally needs the ~4.4 GB FP32 resident, so the lazy
RowDequantSource/ops.gatherrow-dequant path (a24f21d) does not help here — the issue is specifically the transient overshoot above the 4.4 GB floor during eager full materialization.Suggested fix
FloatArray(no boxedFloat, no full-size intermediate copy).Notes / related
-Xmx; that masks the issue and is unusable in CI.🤖 Generated with Claude Code