Extract data from Google search result carousel into JSON format by KiChjang · Pull Request #386 · serpapi/code-challenge

KiChjang · 2026-06-05T07:57:52Z

This PR implements a full HTML parser for the Van Gogh code challenge and hardens it through multiple iterations focused on brittleness reduction and regression safety. The hidden thumbnail image src resolution is in its own dedicated class DeferredImageExtractor.

Data structure assumptions

This parser is intentionally structure-based for the HTML part, and behaviour-based for the deferred image source part (i.e. matching variables to the actual function arguments in _setImagesSrc), but it still relies on a few stable assumptions about Google search results HTML page and inline scripts:

Carousel entry assumptions (DOM parser)

Entry container is an anchor: each extractable item is represented by an <a> node.
Search-link signature: candidate anchors must have an href containing both "/search?" and "q=".
Thumbnail presence: candidate anchors must contain an <img> descendant.
Label extraction model:
- visible labels are leaf elements under the anchor (node.children.all?(&:text?))
- first non-empty label is name
- second non-empty label, if present, is extensions[0] in the output JSON
Output link normalization:
- relative "/search?... links are prefixed to https://www.google.com
- HTML entities in links (e.g. &) are unescaped

Image source assumptions (DOM + script parser)

Primary image source: if <img data-src> is present, it is the canonical image URL.
Deferred image fallback: if data-src is absent, image is resolved by <img id> via script mappings.
Known ID universe: only IDs present in img[id] nodes are considered valid targets.

Deferred script assumptions (JS-like parser)

Relevant scripts contain both "data:image" and "_setImagesSrc", the latter being the function to set the src attribute of the carousel thumbnail.
Mapping happens per _setImagesSrc(...) call (not per script block).
Data URI discovery:
- either direct literal argument ('data:image...')
- or identifier assigned from a data URI literal (var x = 'data:image...')
ID discovery:
- either array literal argument containing quoted known image IDs
- or identifier assigned from an array literal containing quoted known image IDs
Both identifiers for the data URI and the image ID, if they exist, appear in the enclosing block that contains the call to _setImageSrc, and no code comments containing open braces { exist in between the opening braces of the enclosing block and the call to _setImageSrc.
Argument order is not fixed:
- parser scans all args and resolves URI/IDs by content/type, not position.
Escaped JS bytes are expected:
- supports \xNN decoding in data URIs (e.g. \x3d -> =).

Iterations

Initial working parser (class-based)
Used known Google CSS classes for name/date/image. Functional, but brittle.
Moved to structure-based extraction
Replaced class dependence with semantic structure matching (<a> + <img> + leaf text nodes), making DOM matching more resilient to class churn.
Handled escaped script payloads correctly
Found mismatch in base64 output due to JS escaping (\x3d instead of =).
Added unescape_js_string to decode escaped bytes.
Reduced brittle regex dependence on variable names/order
Reworked script extraction away from strict var s=...; var ii=[...] assumptions.
Bug found during review: one URI accidentally applied to all IDs in a script
Corrected by mapping per _setImagesSrc(...) call, resolving IDs and URI from literals/assignments for that call.
Added regression fixture to lock this behavior in
Introduced fixture with multiple mappings in one script and varying argument order so this class of bug is now test-detectable.
Expanded regex patterns to include variable assignment keywords
Added let and const as keywords to look for when scanning for data URIs and image IDs to be more flexible on sudden upstream changes to assignment keywords.
Tightened the data URI and image ID scan to the enclosing block
Optimized the scan by localizing it to the enclosing block of _setImagesSrc, and prevented identifier shadowing.
Extracted deferred image resolution logic into its own class
Eased the maintainability and testability by separating the logic-heavy deferred image resolution into DeferredImageExtractor, complete with tests and fixtures.

KiChjang · 2026-06-05T08:32:11Z

Added an additional StructuralMismatchException class: this is an exception that gets thrown whenever the parsed HTML violates any of the assumptions of the data structure detailed in the PR description. Monitoring tools can then explicitly look out for this particular exception and complain loudly when it is thrown.

KiChjang added 2 commits June 5, 2026 15:46

Initial code challenge commit

bc3e211

Add StructuralMismatchException class and tests for it

08e7ef8

KiChjang added 5 commits June 5, 2026 16:43

Skip on false-positive instead of raising exception

da30674

Capture more variable assignment keywords

68450c8

Add more code comments

6b268a9

Parse only from enclosing block for _setImageSrc arguments

69005b6

Extract deferred image parsing logic into its own class

6baabe6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract data from Google search result carousel into JSON format#386

Extract data from Google search result carousel into JSON format#386
KiChjang wants to merge 7 commits into
serpapi:masterfrom
KiChjang:master

KiChjang commented Jun 5, 2026 •

edited

Loading

Uh oh!

KiChjang commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KiChjang commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data structure assumptions

Carousel entry assumptions (DOM parser)

Image source assumptions (DOM + script parser)

Deferred script assumptions (JS-like parser)

Iterations

Uh oh!

KiChjang commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KiChjang commented Jun 5, 2026 •

edited

Loading