Fix loadHtml failing on HTML with JSON-escape-required characters#204
Merged
Conversation
…sue #203) The streaming JSON parser in process_input reconstructs JSON by re-emitting string and field name values wrapped in quotes, but without re-escaping special characters. This produces invalid JSON when the input contains quotes, backslashes, control characters, or other characters that require JSON escaping, causing loadHtml payloads to silently fail deserialization. Fixed by using serde_json::to_string() to produce properly escaped JSON string literals for both FieldName and ValueString events. Also added 6 comprehensive test cases covering: - Quotes and newlines (reproduces original issue #203) - Backslashes in JS regex and Windows paths - Control characters (tabs, CR, form-feed) - Unicode (multi-byte UTF-8, emoji, DEL) - Braces and brackets inside string content - Long payloads (~64 KB with escape-requiring chars) - Special chars in header map keys and values All tests round-trip payloads through process_input and verify byte-for-byte fidelity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The actson-based approach stream-parsed JSON into events and then reconstructed each top-level value as a String to hand back to serde_json::from_str. That reconstruction was the source of issue #203 (naive string wrapping without escaping) and adds complexity without buying anything for this use case — each Request is consumed as a complete unit, so the per-value streaming actson enables is unused. serde_json::Deserializer::from_reader(...).into_iter::<Request>() provides the same effective behavior (streaming reads from stdin, one value at a time) in ~10 lines with no reconstruction step. Drops the actson dependency entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers behaviors of the serde_json-based stream reader that the previous suite didn't exercise: - Whitespace separation between top-level values - Malformed JSON terminates the stream (no resync) - Schema mismatch terminates the stream - Empty input exits cleanly - Chunked reader (one byte per read) — verifies actual streaming - Receiver dropped does not panic the worker thread Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e6e029e to
49d7b0d
Compare
Skipping 0.3.1 since that tag already exists from a prior release where Cargo.toml was not updated alongside the tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both clients pick up binary 0.3.2 to inherit the loadHtml fix (#203). CHANGELOG entry restructured to cover all three artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #203 and refactors the stdin JSON parser to eliminate the bug class entirely.
The bug:
loadHtmlsilently failed whenever the HTML payload contained characters that require JSON escaping ("in attribute values, newlines, backslashes, control chars, etc.). Root cause was inprocess_input's streaming JSON reconstruction — string values and field names were re-emitted wrapped in raw"without re-escaping, producing invalid JSON thatserde_json::from_strthen rejected. OnlyloadHtmlwas affected because the initialhtmlis passed via CLI argv and bypasses this code path.The fix and follow-on refactor:
9b38cb0— Minimal fix: useserde_json::to_string()to properly escape field names and string values during reconstruction. Adds 7 tests covering quotes/newlines, backslashes, control chars, unicode, braces in strings, ~64 KB payloads, and special chars inloadUrlheaders.a336acc— Replace the actson-based stream-parse-then-reconstruct-then-reparse approach withserde_json::Deserializer::from_reader(...).into_iter::<Request>(). The actson streaming property wasn't actually being used (eachRequestis consumed as a complete unit anyway), and the reconstruction step was the root source of The loadHtml method does not work. #203. Drops theactsondependency. ~80 lines deleted.e6e029e— Add 6 tests for behaviors specific to the new implementation: whitespace-separated values, stream termination on malformed JSON, stream termination on schema mismatch, empty input, byte-by-byte chunked reads (verifies real streaming), and graceful shutdown when the receiver is dropped.Test plan
cargo test --lib— all 18 tests pass (5 pre-existing + 13 new)cargo build— binary builds cleanly with actson removedtest_process_input_load_html_with_quotes_and_newlines) fails onmain, passes on this branchexamples/load-html.tswith a real-world HTML payload containing quoted attributesBehavior changes worth noting
serde_json::StreamDeserializercannot resync after a parse error, and silently continuing past garbage would risk desync rather than recovery. Behavior is covered bytest_process_input_malformed_json_terminates_streamandtest_process_input_wrong_schema_terminates_stream.sender.send(...).unwrap()).🤖 Generated with Claude Code