Skip to content

Interleave read/write/lookup tower specs to reduce proof size#1362

Open
hero78119 wants to merge 18 commits into
feat/keccak_improvementfrom
feat/opt
Open

Interleave read/write/lookup tower specs to reduce proof size#1362
hero78119 wants to merge 18 commits into
feat/keccak_improvementfrom
feat/opt

Conversation

@hero78119

@hero78119 hero78119 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Problem

The old tower argument proves each read, write, and lookup expression as its own tower spec. That is simple, but it creates a large proof surface: every spec carries its own tower proof metadata, evaluation points, and transcript data. Chips with many lookup expressions, especially Keccak, pay this cost hundreds or thousands of times.

This PR packs same-kind tower records together before tower proving, so the prover/verifier see fewer, wider tower specs instead of many narrow specs.

Terminology

  • Tower: the binary reduction tree used by the read/write/lookup argument. The leaf layer contains record MLEs; each internal layer folds adjacent values with the product or logup relation until one root remains.
  • Product tower: the tower for read/write multiplicative checks. Each leaf row has two MLEs and internal nodes multiply the pair.
  • Logup tower: the tower for lookup numerator/denominator checks. Each leaf row has p1, p2, q1, q2; internal nodes combine them into the next logup layer.
  • Spec: one independent tower proof unit. Before this PR, each read/write/lookup expression usually became one spec.
  • Interleaving: packing multiple specs into one wider leaf domain by adding operation bits after the row bits.
  • Virtual leaf: a GPU representation that describes the interleaved leaf without materializing the full padded leaf layer.
  • Tail default: rows outside the real occupied domain are known constants, so the prover can avoid storing them explicitly.

Design Rationale

The protocol-level idea is to reduce the number of tower specs, not to change the meaning of the read/write/lookup argument.

Before interleaving, if a chip has N rows and K lookup expressions, it builds K independent logup towers:

lk0 rows: [a0, a1, ..., aN]
lk1 rows: [b0, b1, ..., bN]
lk2 rows: [c0, c1, ..., cN]

proof specs: lk0 tower, lk1 tower, lk2 tower

After interleaving, the same values are packed into one logical tower. The extra operation bits select which lookup expression is being addressed:

row 0: [a0, b0, c0, padding]
row 1: [a1, b1, c1, padding]
row 2: [a2, b2, c2, padding]
...

proof specs: one interleaved lookup tower

For a toy case with N = 4 rows and K = 3 lookup expressions, the old layout has three towers of height log2(4) = 2. The interleaved layout has one tower over 4 rows * next_power_of_two(3) ops = 16 leaves, so its height is log2(16) = 4. The prover does some padding work, but the verifier and transcript only track one lookup tower spec instead of three.

For Keccak-like chips this matters more: many lookup expressions are compressed into one lookup tower. That is the main source of the proof-size reduction.

Witness Build

Witness build now groups tower-facing MLEs by kind:

  • read records become interleaved product tower inputs;
  • write records become interleaved product tower inputs;
  • lookup records become interleaved logup tower inputs.

On GPU, the build path keeps the interleaved leaf virtual where possible. It builds the first internal layer directly from the virtual leaf, then hands dense tower layers to the tower prover. This avoids keeping thousands of separate tower specs alive and avoids materializing the largest padded leaf layer.

Proving

Tower proving runs sumcheck over the resulting product/logup tower specs. The verifier still checks the same product/logup relations; the difference is that the spec index is now encoded as operation bits in the MLE domain.

This reduces proof size because the proof contains fewer independent tower specs, fewer per-spec evaluation point lists, and fewer transcript commitments for those specs. The tradeoff is that some GPU work moves into wider interleaved towers, so aggregate tower-proving profiler spans can increase even when wall-clock app_prove improves through overlap and smaller proof surface.

Change Highlights

  • Add GPU tower interleaving for read/write/lookup tower records.
  • Build tower-facing witnesses separately from unrelated witness outputs to reduce live VRAM pressure during tower proving.
  • Add GPU memory estimation for the interleaved tower build/prove stages.
  • Update prover/verifier handling for interleaved tower proof metadata and evaluation points.
  • Keep the verifier semantics tied to the original read/write/lookup argument; interleaving changes representation, not the checked relation.

Benchmark / Performance Impact

CI runs compare block 23817600 with GPU proving and proof output enabled:

Published summaries:

Wall Clock

Metric Before After Improvement
E2E total 83.300s 72.400s 1.15x faster, -13.1%
app_prove 68.500s 56.100s 1.22x faster, -18.1%
Sum of shard create_proof times 65.100s 52.620s 1.24x faster, -19.2%
emulator 10.200s 10.100s 1.01x faster, -1.0%

Prover Stage Breakdown

These rows are profiler aggregate spans. With concurrent GPU proving, aggregate subspan time can exceed wall-clock app_prove because work from different shards overlaps.

Stage Before After Ratio / note
commit_traces 17.590s 17.263s 1.02x faster
build_tower_witness_gpu 1.428s 18.765s 13.14x more aggregate GPU work; interleaving moves real work into tower witness build
prove_tower_relation_gpu 22.565s 160.990s 7.13x more aggregate GPU work; wider interleaved towers replace many narrow towers
prove_batched_main_constraints 7.643s 7.364s 1.04x faster
pcs_opening 9.953s 7.996s 1.24x faster

Proof Size

Output Before After Improvement
output/app_proof.bitcode 65.55 MiB 23.75 MiB 2.76x smaller, -63.8%

The proof-size improvement is the primary win: the same block proof drops by 40.80 MiB. This matches the design goal of reducing the number of tower specs and their per-spec proof metadata.

Testing

Validated through the benchmark CI runs above. Both runs completed successfully and verified the generated shard proofs.

Risks and Rollout

  • The main risk is representation mismatch between prover and verifier metadata for interleaved tower specs. The verifier must interpret operation bits consistently with the prover.
  • GPU aggregate tower time is higher in the current implementation. This PR trades some prover-side tower work for much smaller proof size and better wall-clock app proving on the measured block.
  • Rollback is straightforward: route tower witness build/proving back to the non-interleaved per-spec tower path.

Follow-ups

  • Continue optimizing interleaved tower GPU kernels, especially build/prove aggregate time.
  • Add more targeted benchmarks for chips with very high lookup counts, such as Keccak, to track the cost of wider towers separately from full-block overlap.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

@hero78119 hero78119 marked this pull request as draft June 17, 2026 12:27
@hero78119

hero78119 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Profiling update: full no-shard rerun after GPU memory-log gating

Benchmark shape: block 23587691, no --shard-id, CONCURRENT=1, WITGEN=0, MEM_TRACK=0, cache=1, h23, max-cell 6. The latest rerun uses the current feature branch with only the GPU memory-log gating hunks from 74a73354 ported.

Full-run wall time

Case Result app.prove app_prove.inner shard 0 create_proof_of_shard shard 1 create_proof_of_shard total shard create proof app.verify GPU memory logs
Baseline pass 17.4s 16.5s 7.60s 6.14s 13.74s 6.30s 906 device / 4 baseline
Feature before log-gating pass 18.3s 17.5s 8.09s 6.53s 14.62s 6.32s 918 device / 4 baseline
Feature after log-gating pass 18.1s 17.2s 7.97s 6.42s 14.39s 6.21s 0 device / 0 baseline

Log-gating impact

Metric Before After Delta
app.prove 18.3s 18.1s -0.2s
app_prove.inner 17.5s 17.2s -0.3s
total shard create proof 14.62s 14.39s -0.23s
shard 0 create proof 8.09s 7.97s -0.12s
shard 1 create proof 6.53s 6.42s -0.11s
app.verify 6.32s 6.21s -0.11s
GPU memory log lines 922 0 -922

Conclusion: GPU memory-log gating has a measurable but small full-run improvement. The main create_proof regression is not explained by memory-log printing/querying; after log gating, feature remains slower than baseline by +0.65s total shard create proof and +0.7s app.prove.

Proof-size comparison

Proof-size data is unchanged from the output-dir proof-size runs; the latest log-gating rerun did not use --output-dir.

Metric Baseline Feature Delta
total bitcode bytes 9,010,490 3,678,613 -5,331,877 (-59.17%)
total MiB 8.59 3.51 -5.08 MiB
shard 0 bitcode bytes 5,753,614 1,977,959 -3,775,655 (-65.62%)
shard 1 bitcode bytes 3,256,747 1,700,431 -1,556,316 (-47.79%)
verified chip groups 61 / 52 62 / 53 +1 / +1

Tower proof remains the proof-size reduction source. PCS opening stays roughly flat around 1.08-1.09 MB per shard.

Runtime interpretation

The current feature path still improves proof size substantially, but the full-run speed regression remains after removing GPU memory logs. The next optimization target should stay on tower proving work, especially generic tower sumcheck and internal tower build/liveness, not scheduler memory-log printing.

@hero78119

Copy link
Copy Markdown
Collaborator Author

CI slowdown investigation: runs 27701042458 vs 26959441485

Compared the benchmark job logs for:

Run Ceno ref Block Runner Settings
Baseline 26959441485 master 23817600 scroll-SEA2-10 CONCURRENT=1, WITGEN=0, CACHE=1, max cell 1207959552
Feature 27701042458 feat/opt 23817600 scroll-SEA2-10 CONCURRENT=1, WITGEN=0, CACHE=1, max cell 1207959552

The run is not slower because of a different block, runner, shard count, or benchmark setting. Both runs prove the same block and verify 13 shard proofs.

High-level result

Metric Baseline Feature Delta
Benchmark job wall time 12m36s 15m46s +3m10s
app_prove.inner 65.4s 105s +39.6s
Sum create_proof_of_shard 62.01s 100.97s +38.96s
Proof size 56.84 MiB 23.75 MiB -58.2%

The workflow wall time increases by about 25%. The prover stage itself regresses from 65.4s to 105s, about 1.6x.

Stage breakdown from logs

Stage aggregate Baseline Feature Delta
commit_traces sum 17.59s 16.22s -1.37s
prove_batched_main_constraints sum 7.93s 8.13s +0.20s
prove_tower_relation task sum 136.69s 274.01s +137.32s
build_tower_witness_gpu task sum 18.07s 90.31s +72.24s
prove_tower_relation_gpu task sum 115.63s 183.24s +67.62s

Task sums overlap under concurrent proving, so they should not be read as wall time. They do identify the source of the regression: the extra wall time maps to tower proving, especially tower witness build and tower relation GPU proving. PCS commit is slightly faster on the feature branch, and batched main constraints are nearly unchanged.

Keccak memory comparison

Keccak path Max estimated total
Baseline Ecall_Keccak 16112 MB
Feature KeccakCore 13950 MB
Feature KeccakEcall 95 MB

The feature branch reduces Keccak estimated memory and proof size, but the new Keccak/interleaving tower path is slower in tower proving. The current bottleneck is not proof size, PCS commit, or main constraints; it is the tower path overhead introduced by the feature branch.

Conclusion

The slowdown is caused by feat/opt tower proving overhead, not by CI variance or input-size mismatch. The next optimization target should be build_tower_witness_gpu and prove_tower_relation_gpu in the interleaved Keccak tower path.

@hero78119 hero78119 marked this pull request as ready for review June 19, 2026 06:56
@hero78119 hero78119 changed the title interleaving multi-specs read/write/lk tower proving into single read/write/lk Interleave read/write/lookup tower specs to reduce proof size Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant