Skip to content

Extract data from Google search result carousel into JSON format#386

Open
KiChjang wants to merge 7 commits into
serpapi:masterfrom
KiChjang:master
Open

Extract data from Google search result carousel into JSON format#386
KiChjang wants to merge 7 commits into
serpapi:masterfrom
KiChjang:master

Conversation

@KiChjang
Copy link
Copy Markdown

@KiChjang KiChjang commented Jun 5, 2026

This PR implements a full HTML parser for the Van Gogh code challenge and hardens it through multiple iterations focused on brittleness reduction and regression safety. The hidden thumbnail image src resolution is in its own dedicated class DeferredImageExtractor.

Data structure assumptions

This parser is intentionally structure-based for the HTML part, and behaviour-based for the deferred image source part (i.e. matching variables to the actual function arguments in _setImagesSrc), but it still relies on a few stable assumptions about Google search results HTML page and inline scripts:

Carousel entry assumptions (DOM parser)

  • Entry container is an anchor: each extractable item is represented by an <a> node.
  • Search-link signature: candidate anchors must have an href containing both "/search?" and "q=".
  • Thumbnail presence: candidate anchors must contain an <img> descendant.
  • Label extraction model:
    • visible labels are leaf elements under the anchor (node.children.all?(&:text?))
    • first non-empty label is name
    • second non-empty label, if present, is extensions[0] in the output JSON
  • Output link normalization:
    • relative "/search?... links are prefixed to https://www.google.com
    • HTML entities in links (e.g. &amp;) are unescaped

Image source assumptions (DOM + script parser)

  • Primary image source: if <img data-src> is present, it is the canonical image URL.
  • Deferred image fallback: if data-src is absent, image is resolved by <img id> via script mappings.
  • Known ID universe: only IDs present in img[id] nodes are considered valid targets.

Deferred script assumptions (JS-like parser)

  • Relevant scripts contain both "data:image" and "_setImagesSrc", the latter being the function to set the src attribute of the carousel thumbnail.
  • Mapping happens per _setImagesSrc(...) call (not per script block).
  • Data URI discovery:
    • either direct literal argument ('data:image...')
    • or identifier assigned from a data URI literal (var x = 'data:image...')
  • ID discovery:
    • either array literal argument containing quoted known image IDs
    • or identifier assigned from an array literal containing quoted known image IDs
  • Both identifiers for the data URI and the image ID, if they exist, appear in the enclosing block that contains the call to _setImageSrc, and no code comments containing open braces { exist in between the opening braces of the enclosing block and the call to _setImageSrc.
  • Argument order is not fixed:
    • parser scans all args and resolves URI/IDs by content/type, not position.
  • Escaped JS bytes are expected:
    • supports \xNN decoding in data URIs (e.g. \x3d -> =).

Iterations

  1. Initial working parser (class-based)
    Used known Google CSS classes for name/date/image. Functional, but brittle.

  2. Moved to structure-based extraction
    Replaced class dependence with semantic structure matching (<a> + <img> + leaf text nodes), making DOM matching more resilient to class churn.

  3. Handled escaped script payloads correctly
    Found mismatch in base64 output due to JS escaping (\x3d instead of =).
    Added unescape_js_string to decode escaped bytes.

  4. Reduced brittle regex dependence on variable names/order
    Reworked script extraction away from strict var s=...; var ii=[...] assumptions.

  5. Bug found during review: one URI accidentally applied to all IDs in a script
    Corrected by mapping per _setImagesSrc(...) call, resolving IDs and URI from literals/assignments for that call.

  6. Added regression fixture to lock this behavior in
    Introduced fixture with multiple mappings in one script and varying argument order so this class of bug is now test-detectable.

  7. Expanded regex patterns to include variable assignment keywords
    Added let and const as keywords to look for when scanning for data URIs and image IDs to be more flexible on sudden upstream changes to assignment keywords.

  8. Tightened the data URI and image ID scan to the enclosing block
    Optimized the scan by localizing it to the enclosing block of _setImagesSrc, and prevented identifier shadowing.

  9. Extracted deferred image resolution logic into its own class
    Eased the maintainability and testability by separating the logic-heavy deferred image resolution into DeferredImageExtractor, complete with tests and fixtures.

@KiChjang
Copy link
Copy Markdown
Author

KiChjang commented Jun 5, 2026

Added an additional StructuralMismatchException class: this is an exception that gets thrown whenever the parsed HTML violates any of the assumptions of the data structure detailed in the PR description. Monitoring tools can then explicitly look out for this particular exception and complain loudly when it is thrown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant