Skip to content

Sebastian Jimenez Solution#387

Open
sebasjimenez10 wants to merge 1 commit into
serpapi:masterfrom
sebasjimenez10:sj/artwork-search-scrapping
Open

Sebastian Jimenez Solution#387
sebasjimenez10 wants to merge 1 commit into
serpapi:masterfrom
sebasjimenez10:sj/artwork-search-scrapping

Conversation

@sebasjimenez10
Copy link
Copy Markdown

Extract a Google paintings carousel → JSON

Parses the saved Google knowledge-graph paintings carousel into { "artworks": [...] } (name, link, image, optional extensions/date), directly from the local HTML. No extra HTTP requests.

Approach

  • Structural parsing, not CSS classes. Items are found as stick= search links wrapping an <img>, since Google rotates its obfuscated class names (a class-based parser returns 0 items on a newer page).
  • Images resolved offline from inline _setImagesSrc base64 (matched by img id) or the data-src thumbnail URL.
  • Small, focused objects: Parser, ImageResolver, Artwork, SearchResult.

Validation

Tested against 139 artworks across 3 real pages, 100% line/branch coverage: van-gogh (exact match to the SerpApi-provided example), picasso and leonardo (generated, spot-verified snapshots + independent counts).

Usage

bin/extract files/van-gogh-paintings.html prints the JSON. bundle exec rake runs the specs + linter. bin/console opens an IRB session with the classes loaded. See INSTRUCTIONS.md.

Tooling

RSpec (unit + integration), Nokogiri, SimpleCov (100% line/branch), standardrb, Rake. See DESIGN.md for the full approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant