Skip to content

Fix loadHtml failing on HTML with JSON-escape-required characters#204

Merged
just-be-dev merged 5 commits into
mainfrom
load-html-fix
May 25, 2026
Merged

Fix loadHtml failing on HTML with JSON-escape-required characters#204
just-be-dev merged 5 commits into
mainfrom
load-html-fix

Conversation

@just-be-dev
Copy link
Copy Markdown
Owner

@just-be-dev just-be-dev commented May 25, 2026

Summary

Fixes #203 and refactors the stdin JSON parser to eliminate the bug class entirely.

The bug: loadHtml silently failed whenever the HTML payload contained characters that require JSON escaping (" in attribute values, newlines, backslashes, control chars, etc.). Root cause was in process_input's streaming JSON reconstruction — string values and field names were re-emitted wrapped in raw " without re-escaping, producing invalid JSON that serde_json::from_str then rejected. Only loadHtml was affected because the initial html is passed via CLI argv and bypasses this code path.

The fix and follow-on refactor:

  1. 9b38cb0 — Minimal fix: use serde_json::to_string() to properly escape field names and string values during reconstruction. Adds 7 tests covering quotes/newlines, backslashes, control chars, unicode, braces in strings, ~64 KB payloads, and special chars in loadUrl headers.

  2. a336acc — Replace the actson-based stream-parse-then-reconstruct-then-reparse approach with serde_json::Deserializer::from_reader(...).into_iter::<Request>(). The actson streaming property wasn't actually being used (each Request is consumed as a complete unit anyway), and the reconstruction step was the root source of The loadHtml method does not work. #203. Drops the actson dependency. ~80 lines deleted.

  3. e6e029e — Add 6 tests for behaviors specific to the new implementation: whitespace-separated values, stream termination on malformed JSON, stream termination on schema mismatch, empty input, byte-by-byte chunked reads (verifies real streaming), and graceful shutdown when the receiver is dropped.

Test plan

  • cargo test --lib — all 18 tests pass (5 pre-existing + 13 new)
  • cargo build — binary builds cleanly with actson removed
  • Original repro test (test_process_input_load_html_with_quotes_and_newlines) fails on main, passes on this branch
  • Manual sanity check: run examples/load-html.ts with a real-world HTML payload containing quoted attributes

Behavior changes worth noting

  • On malformed JSON or schema mismatch, the stream now terminates rather than logging and continuing. serde_json::StreamDeserializer cannot resync after a parse error, and silently continuing past garbage would risk desync rather than recovery. Behavior is covered by test_process_input_malformed_json_terminates_stream and test_process_input_wrong_schema_terminates_stream.
  • Worker thread no longer panics if the receiver is dropped (previously sender.send(...).unwrap()).

🤖 Generated with Claude Code

just-be-dev and others added 3 commits May 25, 2026 19:32
…sue #203)

The streaming JSON parser in process_input reconstructs JSON by re-emitting
string and field name values wrapped in quotes, but without re-escaping
special characters. This produces invalid JSON when the input contains
quotes, backslashes, control characters, or other characters that require
JSON escaping, causing loadHtml payloads to silently fail deserialization.

Fixed by using serde_json::to_string() to produce properly escaped JSON
string literals for both FieldName and ValueString events.

Also added 6 comprehensive test cases covering:
- Quotes and newlines (reproduces original issue #203)
- Backslashes in JS regex and Windows paths
- Control characters (tabs, CR, form-feed)
- Unicode (multi-byte UTF-8, emoji, DEL)
- Braces and brackets inside string content
- Long payloads (~64 KB with escape-requiring chars)
- Special chars in header map keys and values

All tests round-trip payloads through process_input and verify byte-for-byte
fidelity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The actson-based approach stream-parsed JSON into events and then
reconstructed each top-level value as a String to hand back to
serde_json::from_str. That reconstruction was the source of issue #203
(naive string wrapping without escaping) and adds complexity without
buying anything for this use case — each Request is consumed as a
complete unit, so the per-value streaming actson enables is unused.

serde_json::Deserializer::from_reader(...).into_iter::<Request>()
provides the same effective behavior (streaming reads from stdin,
one value at a time) in ~10 lines with no reconstruction step.

Drops the actson dependency entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers behaviors of the serde_json-based stream reader that the previous
suite didn't exercise:

- Whitespace separation between top-level values
- Malformed JSON terminates the stream (no resync)
- Schema mismatch terminates the stream
- Empty input exits cleanly
- Chunked reader (one byte per read) — verifies actual streaming
- Receiver dropped does not panic the worker thread

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
just-be-dev and others added 2 commits May 25, 2026 19:35
Skipping 0.3.1 since that tag already exists from a prior release where
Cargo.toml was not updated alongside the tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both clients pick up binary 0.3.2 to inherit the loadHtml fix (#203).
CHANGELOG entry restructured to cover all three artifacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@just-be-dev just-be-dev merged commit b1b17ce into main May 25, 2026
4 checks passed
@just-be-dev just-be-dev deleted the load-html-fix branch May 25, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The loadHtml method does not work.

1 participant