Interleave read/write/lookup tower specs to reduce proof size by hero78119 · Pull Request #1362 · scroll-tech/ceno

hero78119 · 2026-06-17T12:27:10Z

Problem

The old tower argument proves each read, write, and lookup expression as its own tower spec. That is simple, but it creates a large proof surface: every spec carries its own tower proof metadata, evaluation points, and transcript data. Chips with many lookup expressions, especially Keccak, pay this cost hundreds or thousands of times.

This PR packs same-kind tower records together before tower proving, so the prover/verifier see fewer, wider tower specs instead of many narrow specs.

Terminology

Tower: the binary reduction tree used by the read/write/lookup argument. The leaf layer contains record MLEs; each internal layer folds adjacent values with the product or logup relation until one root remains.
Product tower: the tower for read/write multiplicative checks. Each leaf row has two MLEs and internal nodes multiply the pair.
Logup tower: the tower for lookup numerator/denominator checks. Each leaf row has p1, p2, q1, q2; internal nodes combine them into the next logup layer.
Spec: one independent tower proof unit. Before this PR, each read/write/lookup expression usually became one spec.
Interleaving: packing multiple specs into one wider leaf domain by adding operation bits after the row bits.
Virtual leaf: a GPU representation that describes the interleaved leaf without materializing the full padded leaf layer.
Tail default: rows outside the real occupied domain are known constants, so the prover can avoid storing them explicitly.

Design Rationale

The protocol-level idea is to reduce the number of tower specs, not to change the meaning of the read/write/lookup argument.

Before interleaving, if a chip has N rows and K lookup expressions, it builds K independent logup towers:

lk0 rows: [a0, a1, ..., aN]
lk1 rows: [b0, b1, ..., bN]
lk2 rows: [c0, c1, ..., cN]

proof specs: lk0 tower, lk1 tower, lk2 tower

After interleaving, the same values are packed into one logical tower. The extra operation bits select which lookup expression is being addressed:

row 0: [a0, b0, c0, padding]
row 1: [a1, b1, c1, padding]
row 2: [a2, b2, c2, padding]
...

proof specs: one interleaved lookup tower

For a toy case with N = 4 rows and K = 3 lookup expressions, the old layout has three towers of height log2(4) = 2. The interleaved layout has one tower over 4 rows * next_power_of_two(3) ops = 16 leaves, so its height is log2(16) = 4. The prover does some padding work, but the verifier and transcript only track one lookup tower spec instead of three.

For Keccak-like chips this matters more: many lookup expressions are compressed into one lookup tower. That is the main source of the proof-size reduction.

Witness Build

Witness build now groups tower-facing MLEs by kind:

read records become interleaved product tower inputs;
write records become interleaved product tower inputs;
lookup records become interleaved logup tower inputs.

On GPU, the build path keeps the interleaved leaf virtual where possible. It builds the first internal layer directly from the virtual leaf, then hands dense tower layers to the tower prover. This avoids keeping thousands of separate tower specs alive and avoids materializing the largest padded leaf layer.

Proving

Tower proving runs sumcheck over the resulting product/logup tower specs. The verifier still checks the same product/logup relations; the difference is that the spec index is now encoded as operation bits in the MLE domain.

This reduces proof size because the proof contains fewer independent tower specs, fewer per-spec evaluation point lists, and fewer transcript commitments for those specs. The tradeoff is that some GPU work moves into wider interleaved towers, so aggregate tower-proving profiler spans can increase even when wall-clock app_prove improves through overlap and smaller proof surface.

Change Highlights

Add GPU tower interleaving for read/write/lookup tower records.
Build tower-facing witnesses separately from unrelated witness outputs to reduce live VRAM pressure during tower proving.
Add GPU memory estimation for the interleaved tower build/prove stages.
Update prover/verifier handling for interleaved tower proof metadata and evaluation points.
Keep the verifier semantics tied to the original read/write/lookup argument; interleaving changes representation, not the checked relation.

Benchmark / Performance Impact

CI runs compare block 23817600 with GPU proving and proof output enabled:

Published summaries:

Wall Clock

Metric	Before	After	Improvement
E2E total	83.300s	72.400s	1.15x faster, -13.1%
app_prove	68.500s	56.100s	1.22x faster, -18.1%
Sum of shard `create_proof` times	65.100s	52.620s	1.24x faster, -19.2%
emulator	10.200s	10.100s	1.01x faster, -1.0%

Prover Stage Breakdown

These rows are profiler aggregate spans. With concurrent GPU proving, aggregate subspan time can exceed wall-clock app_prove because work from different shards overlaps.

Stage	Before	After	Ratio / note
commit_traces	17.590s	17.263s	1.02x faster
build_tower_witness_gpu	1.428s	18.765s	13.14x more aggregate GPU work; interleaving moves real work into tower witness build
prove_tower_relation_gpu	22.565s	160.990s	7.13x more aggregate GPU work; wider interleaved towers replace many narrow towers
prove_batched_main_constraints	7.643s	7.364s	1.04x faster
pcs_opening	9.953s	7.996s	1.24x faster

Proof Size

Output	Before	After	Improvement
`output/app_proof.bitcode`	65.55 MiB	23.75 MiB	2.76x smaller, -63.8%

The proof-size improvement is the primary win: the same block proof drops by 40.80 MiB. This matches the design goal of reducing the number of tower specs and their per-spec proof metadata.

Testing

Validated through the benchmark CI runs above. Both runs completed successfully and verified the generated shard proofs.

Risks and Rollout

The main risk is representation mismatch between prover and verifier metadata for interleaved tower specs. The verifier must interpret operation bits consistently with the prover.
GPU aggregate tower time is higher in the current implementation. This PR trades some prover-side tower work for much smaller proof size and better wall-clock app proving on the measured block.
Rollback is straightforward: route tower witness build/proving back to the non-interleaved per-spec tower path.

Follow-ups

Continue optimizing interleaved tower GPU kernels, especially build/prove aggregate time.
Add more targeted benchmarks for chips with very high lookup counts, such as Keccak, to track the cost of wider towers separately from full-block overlap.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

This reverts commit 052788d.

hero78119 · 2026-06-17T13:36:40Z

Profiling update: full no-shard rerun after GPU memory-log gating

Benchmark shape: block 23587691, no --shard-id, CONCURRENT=1, WITGEN=0, MEM_TRACK=0, cache=1, h23, max-cell 6. The latest rerun uses the current feature branch with only the GPU memory-log gating hunks from 74a73354 ported.

Full-run wall time

Case	Result	`app.prove`	`app_prove.inner`	shard 0 `create_proof_of_shard`	shard 1 `create_proof_of_shard`	total shard create proof	`app.verify`	GPU memory logs
Baseline	pass	17.4s	16.5s	7.60s	6.14s	13.74s	6.30s	906 device / 4 baseline
Feature before log-gating	pass	18.3s	17.5s	8.09s	6.53s	14.62s	6.32s	918 device / 4 baseline
Feature after log-gating	pass	18.1s	17.2s	7.97s	6.42s	14.39s	6.21s	0 device / 0 baseline

Log-gating impact

Metric	Before	After	Delta
`app.prove`	18.3s	18.1s	-0.2s
`app_prove.inner`	17.5s	17.2s	-0.3s
total shard create proof	14.62s	14.39s	-0.23s
shard 0 create proof	8.09s	7.97s	-0.12s
shard 1 create proof	6.53s	6.42s	-0.11s
`app.verify`	6.32s	6.21s	-0.11s
GPU memory log lines	922	0	-922

Conclusion: GPU memory-log gating has a measurable but small full-run improvement. The main create_proof regression is not explained by memory-log printing/querying; after log gating, feature remains slower than baseline by +0.65s total shard create proof and +0.7s app.prove.

Proof-size comparison

Proof-size data is unchanged from the output-dir proof-size runs; the latest log-gating rerun did not use --output-dir.

Metric	Baseline	Feature	Delta
total bitcode bytes	9,010,490	3,678,613	-5,331,877 (-59.17%)
total MiB	8.59	3.51	-5.08 MiB
shard 0 bitcode bytes	5,753,614	1,977,959	-3,775,655 (-65.62%)
shard 1 bitcode bytes	3,256,747	1,700,431	-1,556,316 (-47.79%)
verified chip groups	61 / 52	62 / 53	+1 / +1

Tower proof remains the proof-size reduction source. PCS opening stays roughly flat around 1.08-1.09 MB per shard.

Runtime interpretation

The current feature path still improves proof size substantially, but the full-run speed regression remains after removing GPU memory logs. The next optimization target should stay on tower proving work, especially generic tower sumcheck and internal tower build/liveness, not scheduler memory-log printing.

hero78119 · 2026-06-18T01:58:48Z

CI slowdown investigation: runs 27701042458 vs 26959441485

Compared the benchmark job logs for:

Run	Ceno ref	Block	Runner	Settings
Baseline `26959441485`	`master`	`23817600`	`scroll-SEA2-10`	`CONCURRENT=1`, `WITGEN=0`, `CACHE=1`, max cell `1207959552`
Feature `27701042458`	`feat/opt`	`23817600`	`scroll-SEA2-10`	`CONCURRENT=1`, `WITGEN=0`, `CACHE=1`, max cell `1207959552`

The run is not slower because of a different block, runner, shard count, or benchmark setting. Both runs prove the same block and verify 13 shard proofs.

High-level result

Metric	Baseline	Feature	Delta
Benchmark job wall time	12m36s	15m46s	+3m10s
`app_prove.inner`	65.4s	105s	+39.6s
Sum `create_proof_of_shard`	62.01s	100.97s	+38.96s
Proof size	56.84 MiB	23.75 MiB	-58.2%

The workflow wall time increases by about 25%. The prover stage itself regresses from 65.4s to 105s, about 1.6x.

Stage breakdown from logs

Stage aggregate	Baseline	Feature	Delta
`commit_traces` sum	17.59s	16.22s	-1.37s
`prove_batched_main_constraints` sum	7.93s	8.13s	+0.20s
`prove_tower_relation` task sum	136.69s	274.01s	+137.32s
`build_tower_witness_gpu` task sum	18.07s	90.31s	+72.24s
`prove_tower_relation_gpu` task sum	115.63s	183.24s	+67.62s

Task sums overlap under concurrent proving, so they should not be read as wall time. They do identify the source of the regression: the extra wall time maps to tower proving, especially tower witness build and tower relation GPU proving. PCS commit is slightly faster on the feature branch, and batched main constraints are nearly unchanged.

Keccak memory comparison

Keccak path	Max estimated total
Baseline `Ecall_Keccak`	16112 MB
Feature `KeccakCore`	13950 MB
Feature `KeccakEcall`	95 MB

The feature branch reduces Keccak estimated memory and proof size, but the new Keccak/interleaving tower path is slower in tower proving. The current bottleneck is not proof size, PCS commit, or main constraints; it is the tower path overhead introduced by the feature branch.

Conclusion

The slowdown is caused by feat/opt tower proving overhead, not by CI variance or input-size mismatch. The next optimization target should be build_tower_witness_gpu and prove_tower_relation_gpu in the interleaved Keccak tower path.

hero78119 added 9 commits June 16, 2026 15:35

interleaving of tower proving

508be03

profile gpu tower memory

4893c70

estimate tower eq per round

a2a374f

Respect actual CUDA memory in chip scheduler

052788d

Revert "Respect actual CUDA memory in chip scheduler"

d26a561

This reverts commit 052788d.

Account tower input backing in GPU estimate

2400488

Account split heavy tower liveness

fd0613c

Merge branch 'feat/keccak_improvement' into feat/opt

4dcb4e5

fix gpu tower memory estimate

70bd4f2

hero78119 marked this pull request as draft June 17, 2026 12:27

gated log and update dep

537f1b8

hero78119 added 6 commits June 18, 2026 16:36

Cache verifier VK digest

76d2d34

Add shard proof profiling spans

e2fdec8

Add transcript metadata profiling spans

cea87fd

Cache prover VK digest

47412bc

misc: cleanup

9750ada

Use jagged assist gkr-backend branch

5b21730

hero78119 marked this pull request as ready for review June 19, 2026 06:56

hero78119 changed the title ~~interleaving multi-specs read/write/lk tower proving into single read/write/lk~~ Interleave read/write/lookup tower specs to reduce proof size Jun 19, 2026

hero78119 added 2 commits June 19, 2026 16:31

fix cpu prover and fmt

79ddd70

masked recursion test due to tower interleaving

1e2ff5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interleave read/write/lookup tower specs to reduce proof size#1362

Interleave read/write/lookup tower specs to reduce proof size#1362
hero78119 wants to merge 18 commits into
feat/keccak_improvementfrom
feat/opt

hero78119 commented Jun 17, 2026 •

edited

Loading

Uh oh!

hero78119 commented Jun 17, 2026 •

edited

Loading

Uh oh!

hero78119 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hero78119 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Terminology

Design Rationale

Witness Build

Proving

Change Highlights

Benchmark / Performance Impact

Wall Clock

Prover Stage Breakdown

Proof Size

Testing

Risks and Rollout

Follow-ups

Copilot Reviewer Directive (keep this section)

Uh oh!

hero78119 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Profiling update: full no-shard rerun after GPU memory-log gating

Full-run wall time

Log-gating impact

Proof-size comparison

Runtime interpretation

Uh oh!

hero78119 commented Jun 18, 2026

CI slowdown investigation: runs 27701042458 vs 26959441485

High-level result

Stage breakdown from logs

Keccak memory comparison

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hero78119 commented Jun 17, 2026 •

edited

Loading

hero78119 commented Jun 17, 2026 •

edited

Loading