Extract data from Google search result carousel into JSON format#386
Open
KiChjang wants to merge 7 commits into
Open
Extract data from Google search result carousel into JSON format#386KiChjang wants to merge 7 commits into
KiChjang wants to merge 7 commits into
Conversation
Author
|
Added an additional |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements a full HTML parser for the Van Gogh code challenge and hardens it through multiple iterations focused on brittleness reduction and regression safety. The hidden thumbnail image src resolution is in its own dedicated class
DeferredImageExtractor.Data structure assumptions
This parser is intentionally structure-based for the HTML part, and behaviour-based for the deferred image source part (i.e. matching variables to the actual function arguments in
_setImagesSrc), but it still relies on a few stable assumptions about Google search results HTML page and inline scripts:Carousel entry assumptions (DOM parser)
<a>node."/search?"and"q=".<img>descendant.node.children.all?(&:text?))extensions[0]in the output JSON"/search?...links are prefixed tohttps://www.google.com&) are unescapedImage source assumptions (DOM + script parser)
<img data-src>is present, it is the canonical image URL.data-srcis absent, image is resolved by<img id>via script mappings.img[id]nodes are considered valid targets.Deferred script assumptions (JS-like parser)
"data:image"and"_setImagesSrc", the latter being the function to set the src attribute of the carousel thumbnail._setImagesSrc(...)call (not per script block).'data:image...')var x = 'data:image...')_setImageSrc, and no code comments containing open braces{exist in between the opening braces of the enclosing block and the call to_setImageSrc.\xNNdecoding in data URIs (e.g.\x3d->=).Iterations
Initial working parser (class-based)
Used known Google CSS classes for name/date/image. Functional, but brittle.
Moved to structure-based extraction
Replaced class dependence with semantic structure matching (
<a>+<img>+ leaf text nodes), making DOM matching more resilient to class churn.Handled escaped script payloads correctly
Found mismatch in base64 output due to JS escaping (
\x3dinstead of=).Added
unescape_js_stringto decode escaped bytes.Reduced brittle regex dependence on variable names/order
Reworked script extraction away from strict
var s=...; var ii=[...]assumptions.Bug found during review: one URI accidentally applied to all IDs in a script
Corrected by mapping per
_setImagesSrc(...)call, resolving IDs and URI from literals/assignments for that call.Added regression fixture to lock this behavior in
Introduced fixture with multiple mappings in one script and varying argument order so this class of bug is now test-detectable.
Expanded regex patterns to include variable assignment keywords
Added
letandconstas keywords to look for when scanning for data URIs and image IDs to be more flexible on sudden upstream changes to assignment keywords.Tightened the data URI and image ID scan to the enclosing block
Optimized the scan by localizing it to the enclosing block of
_setImagesSrc, and prevented identifier shadowing.Extracted deferred image resolution logic into its own class
Eased the maintainability and testability by separating the logic-heavy deferred image resolution into
DeferredImageExtractor, complete with tests and fixtures.