Interleave read/write/lookup tower specs to reduce proof size#1362
Interleave read/write/lookup tower specs to reduce proof size#1362hero78119 wants to merge 18 commits into
Conversation
This reverts commit 052788d.
Profiling update: full no-shard rerun after GPU memory-log gatingBenchmark shape: block Full-run wall time
Log-gating impact
Conclusion: GPU memory-log gating has a measurable but small full-run improvement. The main Proof-size comparisonProof-size data is unchanged from the output-dir proof-size runs; the latest log-gating rerun did not use
Tower proof remains the proof-size reduction source. PCS opening stays roughly flat around 1.08-1.09 MB per shard. Runtime interpretationThe current feature path still improves proof size substantially, but the full-run speed regression remains after removing GPU memory logs. The next optimization target should stay on tower proving work, especially generic tower sumcheck and internal tower build/liveness, not scheduler memory-log printing. |
CI slowdown investigation: runs 27701042458 vs 26959441485Compared the benchmark job logs for:
The run is not slower because of a different block, runner, shard count, or benchmark setting. Both runs prove the same block and verify 13 shard proofs. High-level result
The workflow wall time increases by about 25%. The prover stage itself regresses from 65.4s to 105s, about 1.6x. Stage breakdown from logs
Task sums overlap under concurrent proving, so they should not be read as wall time. They do identify the source of the regression: the extra wall time maps to tower proving, especially tower witness build and tower relation GPU proving. PCS commit is slightly faster on the feature branch, and batched main constraints are nearly unchanged. Keccak memory comparison
The feature branch reduces Keccak estimated memory and proof size, but the new Keccak/interleaving tower path is slower in tower proving. The current bottleneck is not proof size, PCS commit, or main constraints; it is the tower path overhead introduced by the feature branch. ConclusionThe slowdown is caused by |
Problem
The old tower argument proves each read, write, and lookup expression as its own tower spec. That is simple, but it creates a large proof surface: every spec carries its own tower proof metadata, evaluation points, and transcript data. Chips with many lookup expressions, especially Keccak, pay this cost hundreds or thousands of times.
This PR packs same-kind tower records together before tower proving, so the prover/verifier see fewer, wider tower specs instead of many narrow specs.
Terminology
p1, p2, q1, q2; internal nodes combine them into the next logup layer.Design Rationale
The protocol-level idea is to reduce the number of tower specs, not to change the meaning of the read/write/lookup argument.
Before interleaving, if a chip has
Nrows andKlookup expressions, it buildsKindependent logup towers:After interleaving, the same values are packed into one logical tower. The extra operation bits select which lookup expression is being addressed:
For a toy case with
N = 4rows andK = 3lookup expressions, the old layout has three towers of heightlog2(4) = 2. The interleaved layout has one tower over4 rows * next_power_of_two(3) ops = 16leaves, so its height islog2(16) = 4. The prover does some padding work, but the verifier and transcript only track one lookup tower spec instead of three.For Keccak-like chips this matters more: many lookup expressions are compressed into one lookup tower. That is the main source of the proof-size reduction.
Witness Build
Witness build now groups tower-facing MLEs by kind:
On GPU, the build path keeps the interleaved leaf virtual where possible. It builds the first internal layer directly from the virtual leaf, then hands dense tower layers to the tower prover. This avoids keeping thousands of separate tower specs alive and avoids materializing the largest padded leaf layer.
Proving
Tower proving runs sumcheck over the resulting product/logup tower specs. The verifier still checks the same product/logup relations; the difference is that the spec index is now encoded as operation bits in the MLE domain.
This reduces proof size because the proof contains fewer independent tower specs, fewer per-spec evaluation point lists, and fewer transcript commitments for those specs. The tradeoff is that some GPU work moves into wider interleaved towers, so aggregate tower-proving profiler spans can increase even when wall-clock
app_proveimproves through overlap and smaller proof surface.Change Highlights
Benchmark / Performance Impact
CI runs compare block
23817600with GPU proving and proof output enabled:Published summaries:
Wall Clock
create_prooftimesProver Stage Breakdown
These rows are profiler aggregate spans. With concurrent GPU proving, aggregate subspan time can exceed wall-clock
app_provebecause work from different shards overlaps.Proof Size
output/app_proof.bitcodeThe proof-size improvement is the primary win: the same block proof drops by
40.80 MiB. This matches the design goal of reducing the number of tower specs and their per-spec proof metadata.Testing
Validated through the benchmark CI runs above. Both runs completed successfully and verified the generated shard proofs.
Risks and Rollout
Follow-ups
Copilot Reviewer Directive (keep this section)
When Copilot reviews this PR, apply
.github/copilot-instructions.mdstrictly.