Skip to content

Streaming pipeline chunk-emitting#562

Merged
aram356 merged 74 commits intomainfrom
specs/streaming-response-optimization
Apr 24, 2026
Merged

Streaming pipeline chunk-emitting#562
aram356 merged 74 commits intomainfrom
specs/streaming-response-optimization

Conversation

@aram356
Copy link
Copy Markdown
Collaborator

@aram356 aram356 commented Mar 25, 2026

Summary

This PR lands the full streaming-response optimization for the publisher proxy, reducing peak WASM memory from ~4× response size to constant and improving TTFB/TTLB on every processable origin response. It replaces the original "buffer everything, rewrite, send" path with a chunked pipeline that streams through the content rewriters and writes directly to the client.

Delivered across four phases (plus follow-up review cycles):

Phase 1 — Chunk-emitting pipeline (streaming_processor.rs)

  • Unified StreamingPipeline<R: Read, W: Write> with a single process_chunks loop that handles uncompressed, gzip↔gzip, gzip→none, deflate↔deflate, deflate→none, brotli↔brotli, and brotli→none, including explicit per-codec finalization semantics (finish() for flate2, flush() + into_inner() for brotli).
  • HtmlRewriterAdapter wraps lol_html::HtmlRewriter as a StreamProcessor that drains output after every chunk via a shared Rc<RefCell<Vec<u8>>> sink, so lol_html buffering no longer dictates allocation size.
  • Compression imports hoisted to module scope to satisfy the repo's no-local-imports convention.

Phase 2 — Stream to client via StreamingBody (publisher.rs, adapter main.rs)

  • handle_publisher_request returns a PublisherResponse sum type (Stream / PassThrough / Buffered) instead of a single Response, so the adapter picks the right Fastly dispatch (stream_to_client() / send_to_client() / return-and-send).
  • The Fastly adapter moved from #[fastly::main] to an undecorated main() with Request::from_client()#[fastly::main] auto-sends the returned response and is incompatible with streaming.
  • process_response_streaming became generic over W: Write, so the same function writes into a Vec<u8> on the buffered route and a StreamingBody on the streaming route. Content-Length is removed on the streaming arm (chunked transfer).
  • restrict_accept_encoding (quality-value-aware) strips unsupported encodings from Accept-Encoding before proxying, so the origin never responds with a codec the pipeline cannot decode.

Phase 3 — Fragment-safe script rewriters (google_tag_manager.rs, nextjs/script_rewriter.rs, nextjs/rsc_placeholders.rs)

  • With streaming input, lol_html now fragments <script> text nodes at chunk boundaries. NextJsNextDataRewriter and GoogleTagManagerIntegration accumulate intermediate fragments into a Mutex<String> buffer, suppress them via RemoveNode, and emit the full rewritten text on is_last_in_text_node. The Mutex is uncontended (the pipeline is single-threaded; lol_html::HtmlRewriter is !Send) and is documented with the invariant that the buffer is empty at the start of every text node.
  • NextJsRscPlaceholderRewriter deliberately does not accumulate — the accompanying NextJsHtmlPostProcessor re-parses the final document, so fragmented RSC scripts are safely deferred. Accumulating here would risk corrupting non-RSC scripts that happen to be fragmented.
  • GTM uses a script selector that matches every inline script. To prevent GTM's final Replace(accumulated) from overwriting an earlier NextJsNextDataRewriter::Replace on the same element (lol_html runs overlapping text! handlers sequentially; last text.replace wins), GTM only engages its accumulation path when the running text contains a GTM/GA marker or ends with a proper prefix of one of length ≥ 6 bytes ("google" — the common prefix of googletagmanager.com and google-analytics.com). Shorter tails (g, go, goo, goog, googl, plus common English/minified tokens like img, slug, config, thing) no longer claim ownership, so overlapping rewriters are preserved.
  • HtmlWithPostProcessing bifurcates: when no post-processors are registered, chunks pass through immediately (the streaming win); when post-processors are registered, output accumulates in an internal buffer until is_last, then post-processors run on the full document. Uses IntegrationRegistry::has_html_post_processors() (new, avoids cloning the Arc vec).

Phase 4 — Binary pass-through via send_to_client() (publisher.rs)

  • Non-processable 2xx responses (images, fonts, video) return PublisherResponse::PassThrough { response, body }. The adapter reattaches the body and calls send_to_client(), preserving Content-Length and avoiding chunked-encoding overhead. Fastly streams the body from its internal buffer without copying into WASM memory.
  • 204 No Content and 205 Reset Content are excluded at the top of the classifier (RFC 9110 §15.3.5 / §15.3.6 prohibit a message body) so no downstream arm can emit a body for a status that must not have one.

Routing classifier (classify_response_route)

  • Pure function driving the PublisherResponse selection from (status, content_type, content_encoding, request_host, has_post_processors). Four outcomes:
    • PassThrough — 2xx non-processable.
    • Stream — processable content (2xx and non-2xx — branded 404/500 HTML and error JSON still get origin URL rewriting, matching pre-streaming behavior) with supported encoding and either non-HTML or no HTML post-processors.
    • BufferedProcessed — HTML with post-processors registered (full document needed for injection).
    • BufferedUnmodified — 204/205, empty request host with processable content (misconfig warning), or unsupported Content-Encoding.

Unsupported encodings and accept-encoding restriction

  • is_supported_content_encoding (known-list gate on ""/identity/gzip/deflate/br) routes unsupported codecs (e.g. zstd) to BufferedUnmodified rather than feeding encoded bytes to the identity decoder.

Files touched

 crates/trusted-server-adapter-fastly/src/main.rs   |   85 +-
 crates/trusted-server-core/src/html_processor.rs   |  198 +++-
 .../src/integrations/google_tag_manager.rs         |  503 ++++++++-
 .../src/integrations/nextjs/mod.rs                 |   51 +
 .../src/integrations/nextjs/rsc_placeholders.rs    |   11 +-
 .../src/integrations/nextjs/script_rewriter.rs     |  154 ++-
 .../src/integrations/registry.rs                   |    9 +
 crates/trusted-server-core/src/publisher.rs        |  824 +++++++++++---
 .../trusted-server-core/src/streaming_processor.rs |  867 +++++++++------
 docs/superpowers/plans/2026-03-25-streaming-response.md
 docs/superpowers/specs/2026-03-25-streaming-response-design.md
 (+ CI pin cleanup from #656 reflected via main merge)

Test plan

Shipped in this PR (all green under cargo test --workspace, cargo clippy --workspace --all-targets --all-features -- -D warnings, cargo fmt --all -- --check):

  • Pipeline codecs: uncompressed, gzip↔gzip, gzip→none, deflate↔deflate, brotli↔brotli round-trip tests; every compressed-output test decompresses and asserts bytes, proving encoder finalization.
  • Chunk-boundary fragmentation invariant: lol_html_fragments_text_across_chunk_boundaries verifies lol_html splits text at chunk boundaries (the assumption all fragment-safe rewriters rely on).
  • HtmlRewriterAdapter: streams per chunk, handles large input, tolerates reset() as a no-op, and emits output before is_last.
  • Classifier: 14 gate tests covering every (status, content-type, encoding, host, post-processors) combination, including 204/205 exclusion for processable content, non-2xx processable streaming, and unsupported-encoding fallback.
  • Fragment-safe rewriters: unit tests for NextJsNextDataRewriter accumulation, no-URL passthrough, and GTM's prefix gate (might_contain_gtm_prefix_detects_full_match_and_boundary_prefix, might_contain_gtm_prefix_rejects_short_ambiguous_tails). End-to-end tests with chunk_size=32 (small_chunk_next_data_rewrite_survives_fragmentation, small_chunk_gtm_rewrite_survives_fragmentation) force fragmentation through the full HTML pipeline.
  • Overlap regressions: fragmented_next_data_survives_with_gtm_enabled and fragmented_next_data_with_trailing_g_survives_gtm (this latter exercises the short-g-tail overlap case from review P1) confirm NextJs rewrites are not clobbered by GTM on the same fragmented element.
  • Pass-through: classifier coverage for every 2xx non-processable path plus pass_through_preserves_body_and_content_length proving byte-for-byte identity.
  • stream_publisher_body end-to-end: gzip CSS round-trip with origin rewriting.
  • Accept-Encoding restriction: absent header, unsupported-only (zstd), identity-only, quality-value handling (br, gzip;q=0).
  • EC header + consent handling on the streaming path: headers set before body consumption; revocation targets cookie EC ID not header EC ID.

Coverage gaps from the initial audit now closed (commit ebbfae7):

  • small_chunk_rsc_push_survives_fragmentation_via_post_processor_fallback — fragmented __next_f.push through the post-processor fallback (also the path suspected in the autoblog parse error).
  • stream_publisher_body_handles_empty_body — empty origin body on the streaming route.
  • stream_publisher_body_surfaces_mid_stream_decode_error — mid-stream decode error must return Err so the adapter can drop the StreamingBody instead of emitting malformed output.
  • publisher_response_pass_through_reattach_preserves_bytes — walks the adapter's take-body/reattach pattern explicitly.
  • buffered_processed_sets_content_length_from_processed_body — asserts the classifier → pipeline composition for HTML + post-processor.
  • document_state_placeholders_substitute_through_accumulating_pathIntegrationDocumentState written during streaming is read and consumed by the post-processor.

Test totals: 848 workspace tests, all green. Clippy + fmt clean.

Rollout

aram356 added 3 commits March 25, 2026 08:20
Move spec-like documents from root and docs/internal into a structured
layout under docs/superpowers with timestamped filenames:

- specs/: active design documents (SSC PRD, technical spec, EdgeZero
  migration, auction orchestration flow, production readiness report)
- archive/: completed or historical specs (optimization, sequence,
  publisher IDs audit)
Design spec for streaming HTTP responses through the publisher proxy
when Next.js is disabled. Covers two implementation steps:

1. Make the streaming pipeline chunk-emitting (HtmlRewriterAdapter,
   gzip, encoder finalization)
2. Wire up Fastly StreamingBody via stream_to_client() with entry
   point migration from #[fastly::main] to undecorated main()

Includes streaming gate logic, error handling, rollback strategy,
and testing plan. Verified against Fastly SDK 0.11.12 API.
aram356 added 3 commits March 25, 2026 10:57
Add before/after measurement protocol using Chrome DevTools MCP tools:
network timing capture, Lighthouse audits, performance traces, memory
snapshots, and automated body hash comparison for correctness.
Base automatically changed from docs/organize-specs-and-archive to main March 26, 2026 15:57
aram356 added 17 commits March 26, 2026 08:58
…ation test

- Add debug-level logging to process_chunks showing total bytes
  read and written per pipeline invocation
- Add brotli-to-brotli round-trip test to cover the into_inner()
  finalization path
- Add test proving HtmlWithPostProcessing accumulates output when
  post-processors are registered while streaming path passes through
- Group std imports together (cell, io, rc) before external crates
- Document supported compression combinations on PipelineConfig
- Remove dead-weight byte counters from process_chunks hot loop
- Fix stale comment referencing removed process_through_compression
- Fix brotli finalization: use drop(encoder) instead of into_inner()
  to make the intent clear (CompressorWriter writes trailer on drop)
- Document reset() as no-op on HtmlRewriterAdapter (single-use)
- Add brotli round-trip test covering into_inner finalization path
- Add gzip HTML rewriter pipeline test (compressed round-trip with
  lol_html, not just StreamingReplacer)
- Add HtmlWithPostProcessing accumulation vs streaming behavior test
- Add Eq derive to Compression enum (all unit variants, trivially correct)
- Brotli finalization: use into_inner() instead of drop() to skip
  redundant flush and make finalization explicit
- Document process_chunks flush semantics: callers must still call
  encoder-specific finalization after this method returns
- Warn when HtmlRewriterAdapter receives data after finalization
  (rewriter already consumed, data would be silently lost)
- Make HtmlWithPostProcessing::reset() a true no-op matching its doc
  (clearing auxiliary state without resetting rewriter is inconsistent)
- Document extra copying overhead on post-processor path vs streaming
- Assert output content in reset-then-finalize test (was discarded)
- Relax per-chunk emission test to not depend on lol_html internal
  buffering behavior — assert concatenated correctness + at least one
  intermediate chunk emitted
aram356 added 13 commits April 8, 2026 18:57
Non-processable 2xx responses (images, fonts, video) now stream
directly to the client via PublisherResponse::PassThrough instead
of buffering the entire body in memory. Content-Length is preserved
since the body is unmodified.
Tests verify non-processable 2xx responses return PassThrough,
non-processable errors stay Buffered, and processable content
goes through Stream (not PassThrough).
Adds pass_through_preserves_body_and_content_length test that
verifies io::copy produces identical output and Content-Length
is preserved. Updates handle_publisher_request doc to describe
all three response variants.
- Exclude 204 No Content from PassThrough (must not have body)
- Remove Content-Length before streaming (stream_to_client uses
  chunked encoding, keeping both violates HTTP spec)
- Add tests for 204 exclusion and empty-host interaction
- Update doc comment and byte-level test to reflect CL removal
PassThrough reattaches the unmodified body and uses send_to_client()
instead of stream_to_client() + io::copy. This preserves
Content-Length (avoids chunked encoding overhead for images/fonts)
and lets Fastly stream from its internal buffer without WASM memory
buffering.
- Fix PassThrough doc comment operation order (set_body before finalize)
- Update function doc to describe actual PassThrough flow (set_body +
  send_to_client, not io::copy)
- Remove dead _enters_early_return variable, replace with comment
When the origin returns a processable 2xx response with an encoding the
pipeline cannot decompress (e.g. `zstd` from a misbehaving origin), the
buffered fallback previously still routed the body through
process_response_streaming. `Compression::from_content_encoding` maps
unknown values to `None`, so the rewriter would treat the compressed
bytes as identity-encoded text and emit garbled output.

Bypass the rewrite pipeline entirely in that case and return the origin
response untouched. Adds a test asserting byte-for-byte pass-through
and updates the is_supported_content_encoding doc to reflect the new
behavior.

Addresses PR #585 review feedback from @prk-Jr.
…eaming-pipeline-phase2

# Conflicts:
#	crates/trusted-server-core/src/publisher.rs
- Clarify PassThrough variant doc: finalize_response() and send_to_client()
  are applied at the outer dispatch level, not in this arm
- Hoist status outside the early-return block and reuse is_success to
  eliminate the duplicate get_status() call
- Exclude 205 Reset Content alongside 204 No Content per RFC 9110 §15.3.6;
  add pass_through_gate_excludes_205_reset_content test
- Log binary pass-through before returning to aid production tracing
@aram356 aram356 changed the title Phase 1: Spec + Make streaming pipeline chunk-emitting Streaming pipeline chunk-emitting Apr 17, 2026
Comment thread crates/trusted-server-core/src/streaming_processor.rs Outdated
Comment thread crates/trusted-server-core/src/publisher.rs
Comment thread crates/trusted-server-core/src/publisher.rs Outdated
Copy link
Copy Markdown
Collaborator

@ChristianPavilonis ChristianPavilonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass review. Previous comments were addressed — has_html_post_processors() is implemented.

Previous comments resolved: Both prior inline comments are addressed — the has_html_post_processors() optimization is present in registry.rs and the streaming plan doc was updated.

New findings:

  1. Doc comment copy-paste in HtmlRewriterAdapter::reset() (commented inline). Minor, but renders incorrect in rustdoc.

  2. Gate tests validate boolean expressions, not function behavior (commented inline). The streaming_gate_* and pass_through_gate_* tests reproduce the conditional logic from handle_publisher_request as local variables and assert on those variables. They pass even if the gate in the actual function is changed. The unsupported_encoding_response_is_returned_unmodified test (which uses a real Response) is the right model — the gate tests should follow that pattern using handle_publisher_request returning a PublisherResponse variant.

  3. Missing WARN log when processable content is skipped due to empty request host (commented inline). The combined condition !should_process || request_host.is_empty() || !is_success logs at DEBUG for all three cases. An empty request_host with processable content is a misconfiguration that silently returns unrewritten HTML/JS to the client. Worth a separate warn branch.

No blocking issues. The core streaming architecture — chunk-emitting HtmlRewriterAdapter, unified process_chunks loop, PublisherResponse enum, and the pass-through/stream/buffered routing in main.rs — is sound. The Mutex accumulation buffers in GTM and NextJS rewriters are correctly drained on the final fragment path. The brotli encoder finalization via flush() inside process_chunks followed by into_inner() is correct. Content-Length handling is consistent across all three response paths.

Copy link
Copy Markdown
Collaborator

@ChristianPavilonis ChristianPavilonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings from second pass are non-blocking — approving.

@aram356 aram356 removed the request for review from jevansnyc April 20, 2026 16:00
Copy link
Copy Markdown
Collaborator

@prk-Jr prk-Jr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good

aram356 added 5 commits April 22, 2026 12:06
- Extract gate decision into pure classify_response_route so the
  formula lives in one place; handle_publisher_request dispatches on
  the returned ResponseRoute variant. Gate tests now exercise this
  function directly instead of re-stating its boolean expressions.
- Emit log::warn! when processable content returns unmodified because
  request_host is empty (misconfigured proxy Host header) so it is
  visible at production log levels.
- Fix copy-paste in HtmlRewriterAdapter::reset doc comment.
- Route non-2xx processable responses through the streaming pipeline so
  branded 404/500 HTML and error JSON still get origin URL rewriting,
  matching pre-streaming behavior on main.
- Exclude 204 No Content and 205 Reset Content at the top of
  classify_response_route so no downstream arm can emit a body for a
  status that prohibits one (RFC 9110 §15.3.5 / §15.3.6), regardless of
  Content-Type or post-processor registration.
- Stop GoogleTagManagerIntegration from claiming scripts it cannot
  plausibly rewrite. The rewriter now returns Keep on every fragment
  unless the running text contains a GTM/GA domain or ends with a proper
  prefix of one. Fixes a regression where GTM's Replace(unchanged_full)
  clobbered NextJsNextDataRewriter's Replace on fragmented __NEXT_DATA__
  text nodes, because lol_html runs overlapping text handlers in order
  and last-replace-wins.
- Hoist flate2/brotli imports in StreamingPipeline::process to module
  scope to satisfy the repo's no-local-imports convention.
- Add regression tests: non-2xx Stream/BufferedProcessed routing, 204/205
  exclusion for processable content, NextJs+GTM fragmented __NEXT_DATA__
  round-trip, and boundary-prefix detection for GTM markers.
…ness

P1: Tighten GoogleTagManagerIntegration's boundary-safe prefix gate.
Previously any non-empty prefix of googletagmanager.com /
google-analytics.com was treated as a plausible continuation, so
fragments whose tails matched as little as "g" (common in English and
minified identifiers like "img", "slug", "config", "thing") engaged
GTM's RemoveNode + Replace path. On fragmented scripts that turned out
not to be GTM, GTM would then re-emit Replace(unchanged) on the last
fragment and clobber replacements from overlapping script rewriters
(e.g., NextJsNextDataRewriter on script#__NEXT_DATA__), because
lol_html runs overlapping text! handlers in order and the last
text.replace wins.

Require a minimum trailing prefix length of 6 bytes ("google" — the
common prefix of both markers). Below that, tails are ambiguous with
ordinary tokens and must not engage accumulation. Add a regression
test (fragmented_next_data_with_trailing_g_survives_gtm) that drives
fragmented __NEXT_DATA__ with an intermediate fragment ending in
a "g"-tail word and verifies NextJs's URL rewrite survives. Add a
dedicated unit test covering short-tail rejection and tightens the
existing full-match / boundary-prefix coverage.

P3: Refresh publisher docs to match the streaming behavior.
The module-level comment claimed process_response_streaming accepts
and returns fastly::Body, but it now writes into a generic Write and
returns (). The PublisherResponse docs and handle_publisher_request
summary also still described Stream as 2xx-only, while the current
classifier intentionally routes processable non-2xx responses through
Stream / BufferedProcessed too.
Adds six regression tests covering previously thin or uncovered paths
introduced by the streaming optimization:

1. Fragmented RSC push script through the post-processor fallback
   (nextjs/mod.rs::small_chunk_rsc_push_survives_fragmentation_via_post_processor_fallback).
   Drives a large self.__next_f.push([...]) script through the full
   StreamingPipeline with chunk_size = 128 so NextJsRscPlaceholderRewriter
   returns Keep on every fragment and post_process_rsc_html_in_place_with_limit
   has to re-parse the accumulated HTML. Asserts all origin URLs are
   rewritten, no placeholder leaks, and the push-call envelope stays
   structurally intact (this is the path I suspect in the autoblog parse
   error; now we have a local harness to iterate on it).

2. Empty origin body on the streaming route
   (publisher.rs::stream_publisher_body_handles_empty_body). Exercises
   the Ok(0)-on-first-read branch of process_chunks plus the processor's
   is_last=true, chunk=[] terminal call.

3. Mid-stream decode error surfacing
   (publisher.rs::stream_publisher_body_surfaces_mid_stream_decode_error).
   Claims gzip encoding and feeds non-gzip bytes; asserts the error
   propagates out of stream_publisher_body so the adapter can drop the
   StreamingBody rather than emit malformed output.

4. PassThrough adapter contract
   (publisher.rs::publisher_response_pass_through_reattach_preserves_bytes).
   Walks the adapter's take-body/reattach pattern explicitly and asserts
   byte-for-byte identity plus Content-Length preservation, complementing
   the classifier-only test.

5. BufferedProcessed dispatch contract
   (publisher.rs::buffered_processed_sets_content_length_from_processed_body).
   Confirms nextjs integration registers an HTML post-processor, that the
   classifier routes HTML to BufferedProcessed when one is present, and
   that feeding HTML through the same pipeline yields a rewritten body.

6. Document-state survival through the accumulating path
   (publisher.rs::document_state_placeholders_substitute_through_accumulating_path).
   Verifies that NextJsRscPlaceholderRewriter's IntegrationDocumentState
   entry written during streaming is read by NextJsHtmlPostProcessor
   after accumulation, so placeholders never leak to output and the
   origin URL in the RSC payload is rewritten.

All tests pass (848 total, +6 from this commit). Clippy and fmt clean.
@aram356 aram356 merged commit 5ff3e67 into main Apr 24, 2026
13 checks passed
@aram356 aram356 deleted the specs/streaming-response-optimization branch April 24, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create spec and plan for streaming response for non-Next.js publisher proxy Implement streaming response optimization for non-Next.js publisher proxy

3 participants