Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,9 @@ build-iPhoneSimulator/
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc
.DS_Store

# SimpleCov files
coverage/

# Claude files
.claude/
2 changes: 2 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--require spec_helper
--format documentation
1 change: 1 addition & 0 deletions .ruby-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
4.0.0
156 changes: 156 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Solution Design

Extract a Google knowledge-graph paintings carousel from a saved results page into a JSON array. The HTML file is parsed directly — **no additional HTTP requests**.

## Approach summary

The carousel is parsed **structurally, not by CSS class**, because Google rotates its obfuscated class names (verified: van-gogh's classes are 100% absent from the newer Picasso page, where a class-based parser returns **0** items). Three small objects do the work:

- **`Parser`** — finds each item as a `stick=` search link that wraps an `<img>`, then reads `name` (the img `alt`), `extensions` (caption rows after the title), and `link` (the absolutized `href`).
- **`ImageResolver`** — supplies the image without any network call: inline base64 (injected by `_setImagesSrc` scripts, keyed by img `id`) or the lazy `data-src` thumbnail URL.
- **`Artwork` / `SearchResult`** — render the `{ "artworks": [...] }` output.

Validated against three real pages: van-gogh (47 items, exact match to the SerpApi-provided example), Picasso (45 items), and Leonardo da Vinci (47 items) — the latter two via generated, spot-verified snapshots plus independent counts.

## 1. Output schema

```json
{
"artworks": [
{
"name": "The Starry Night",
"link": "https://www.google.com/search?...",
"image": "data:image/jpeg;base64,...", // or an https URL, or null
"extensions": ["1889"] // optional — omitted when there is no date
}
]
}
```

- `name` — String. The painting title.
- `link` — String. Absolute Google search URL.
- `image` — String or null. Inline base64 data URI, a `gstatic` thumbnail URL, or `null` if neither is present in the file.
- `extensions` — Array of String. Models SerpApi's field; on this page it holds at most the date (e.g. `"1889"`). **The key is omitted entirely when no date is present** (4 of 47 items), matching `van-gogh-expected-array.json`.

## 2. How an item is recognized

Each painting is one carousel cell — a single `<a>` anchor, annotated below. The parser matches on its **structure**, never on class names, which change over time (Section 4):

```html
<a href="/search?...&q=The+Starry+Night&stick=H4sI..."> <!-- (1) a "stick=" search link... -->
<img alt="The Starry Night" <!-- (2) ...wrapping an <img>; alt = name -->
id="_L_FkZ...63" <!-- id -> inline base64 (first 8) -->
data-src="https://encrypted-tbn0.gstatic..." <!-- data-src -> thumbnail URL (the rest) -->
src="data:image/gif;base64,...1x1 placeholder..."/>
<div> <!-- (3) caption = stack of leaf <div> rows -->
<div>The Starry Night</div> <!-- row 1 = title -> name -->
<div>1889</div> <!-- row 2 = metadata -> extensions -->
</div>
</a>
```

These four signals — a `stick=` link wrapping an `<img>`, the `alt`, the caption leaf rows, and the img `id`/`data-src` — appear on every capture. Only the class names differ between pages, which is exactly why we don't match on them:

| Structural role | van-gogh class | picasso class | leonardo class |
| --------------- | -------------- | ------------- | -------------- |
| item wrapper | `iELo6` | `TILZre` | `TILZre` |
| thumbnail img | `taFZJe` | `pHjwVc` | `pHjwVc` |
| caption box | `KHK6lb` | `Y5eSNd` | `Y5eSNd` |
| title row | `pgNMRc` | `yfEcJe` | `yfEcJe` |
| date row | `cxzHyb` | `DWyOHb` | `DWyOHb` |

Picasso and Leonardo share one capture generation (identical classes); van-gogh is an older one. The structural parser handles all three unchanged.

### Field sources

| Field | Source | Handling |
| ------------ | ----------------------------------------------- | ------------------------------------------------------------------------------------ |
| `name` | `img@alt` | Collapse internal whitespace/newlines. |
| `extensions` | caption leaf divs after the title (e.g. `1937`) | Array of the remaining rows; `[]` when absent or blank (key omitted at render time). |
| `link` | `a@href` | Relative `/search?...` → prepend `https://www.google.com`; decode HTML entities. |
| `image` | see Section 3 (ImageResolver) | The img's own `src` is a throwaway 1×1 gif — always ignored. |

### Image delivery: two mechanisms

1. **Inline base64.** The `<img>` has an `id` and no `data-src`. The real JPEG is injected by a script block elsewhere in the document:

```js
(function () {
var s = "data:image/jpeg;base64,/9j/4AAQ...";
var ii = ["_L_FkZ4qlAtyDwbkP49Pj0QU_63"];
var r = "";
_setImagesSrc(ii, s, r);
})();
```

Each `_setImagesSrc` block pairs one base64 string (`s`) with a single image id (the sole `ii` entry); the resolver maps that id to the base64. The base64 lives inside a JS string literal where the `=` padding is hex-escaped as `\x3d` (the only escape seen across captures); it must be unescaped to match the expected output.

2. **Lazy-loaded URL.** The `<img>` carries `data-src="https://encrypted-tbn{0..3}.gstatic.com/images?q=tbn:..."`. We record the URL string as-is (no fetch).

Counts confirm the model against `files/van-gogh-expected-array.json`: 47 carousel items = 8 base64 + 39 `gstatic` URLs = 47 artworks.

## 3. Entities

1. **`ImageResolver`** — Pre-scans the document's `<script>` text for `_setImagesSrc(...)` blocks and builds an `id → base64` map. Resolves an `<img>` to its image: base64 from the map (by `id`), else `data-src`, else `nil`.
2. **`Parser`** — Owns the Nokogiri document. Locates carousel items and, for each, extracts `name`, `extensions`, `link`, delegating the image to `ImageResolver`. Returns a collection of `Artwork`.
3. **`Artwork`** — Plain value object (`name`, `extensions`, `link`, `image`) that knows how to render itself to the output hash / JSON.
4. **`SearchResult`** — Thin facade and entry point: `SearchResult.from_file(path)` (or `.new(html)`) exposes `artworks` and renders `to_h` / `to_json` as `{ "artworks" => [...] }`.

> The classes are top-level for simplicity. In a larger codebase they'd be wrapped in a module (e.g. `ArtworkCarousel::Parser`) to avoid polluting the global namespace; it felt like overkill for four small files here.

## 4. Parsing flow

1. Read the HTML file; build a Nokogiri document.
2. `ImageResolver` builds the `id → base64` map from `_setImagesSrc` scripts.
3. Select carousel items structurally: `a[href*='stick=']` anchors that wrap an `<img>`. On all three fixtures this isolates the carousel exactly (47 / 45 / 47) — the other `stick=` links are knowledge-graph/related-search links that don't wrap an image.
4. Build an `Artwork` per anchor (fields per Section 2), resolving its image via `ImageResolver`.
5. Serialize to `{ "artworks" => [...] }`.

### Robustness for other layouts

The four structural signals are stable across captures; the class names are not (Section 2) — which is the whole reason we match on structure, and why the parser handles all three pages unchanged (47/47, 45/45, 47/47). The `[data-attrid$=":works"]` container also survives every capture, so it could serve as an optional sanity guard — but it is paintings-specific, and the structural rule alone already isolates the carousel exactly (47/45/47), so we didn't add it. The selector logic is isolated in `Parser` for per-layout tweaks.

## 5. Test plan (RSpec)

- **Schema/shape:** every artwork has the four keys; `extensions` is always an array.
- **Golden file:** parsing `files/van-gogh-paintings.html` equals `files/van-gogh-expected-array.json` (count = 47, and field-by-field equality).
- **Image coverage:** every page is exactly 8 base64 images plus the rest as `gstatic` URLs, with no `nil` — van-gogh 8 + 39 (47), picasso 8 + 37 (45), leonardo 8 + 39 (47).
- **Link normalization:** links are absolute `https://www.google.com/...`.
- **Second layout (`pablo-picasso-paintings.html`):** a more recent capture with rotated class names. No SerpApi example, so we use a generated, spot-verified snapshot (`pablo-picasso-expected-array.json`) for exact-match regression, plus independent inspection-derived counts (45 items, all named with absolute links, 39 dated) and a _Guernica_ spot-check.
- **Third page (`leonardo-da-vinci-paintings.html`):** a different artist/item set, parsed with no code changes. Snapshot `leonardo-da-vinci-expected-array.json` for exact-match regression, plus counts (47 items, 34 dated) and a _Salvator Mundi_ spot-check; _Vitruvian Man_ exercises the omitted-extensions (dateless) path.
- **Class independence:** a fixture with arbitrary/unknown class names still parses.

### Test coverage

SimpleCov (started in `spec/spec_helper.rb`, report written to `coverage/`):

- **Line coverage: 100.0%** (73 / 73)
- **Branch coverage: 100.0%** (6 / 6)

Branch coverage drove a cleanup: SimpleCov flagged four conditionals never met by any of the three captures — a nil-`img` guard, a `\u`/`\<char>` escape fallback (the pages only use `\x`), an empty-`alt` name fallback, and an absolute-`href` passthrough (every real `href` is relative). Each was speculative defensiveness unsupported by the evidence (139/139 items), so all four were removed, which also simplified the code.

## 6. Tech choices

**Ruby + RSpec + Nokogiri**, per the README's suggestion. Pure offline parsing — no network at runtime.

### Resources

- [README.md](./README.md) — challenge instructions
- `files/van-gogh-paintings.html` → input · `van-gogh-expected-array.json` → example (SerpApi-provided)
- `files/pablo-picasso-paintings.html` → input · `pablo-picasso-expected-array.json` → snapshot (generated, spot-verified)
- `files/leonardo-da-vinci-paintings.html` → input · `leonardo-da-vinci-expected-array.json` → snapshot (generated, spot-verified)

## 7. Versions

Developed and tested against:

| Tool | Version | Notes |
| --------- | ------- | -------------------------------------- |
| Ruby | 4.0.0 | language runtime |
| Bundler | 4.0.6 | dependency management (`BUNDLED WITH`) |
| Nokogiri | 1.19.3 | HTML parsing |
| RSpec | 3.13.2 | test framework |
| SimpleCov | 0.22.0 | test coverage |
| Standard | 1.54.0 | linter / formatter (`standardrb`) |

Gem versions are pinned in `Gemfile.lock`; `Gemfile` constrains Nokogiri `~> 1.19`, RSpec `~> 3.13`, SimpleCov `~> 0.22`, and Standard `~> 1.0`. Run `bundle install`, then `bundle exec rake` to run the specs and the linter together (the default task); or invoke them individually with `bundle exec rspec` and `bundle exec standardrb`.
11 changes: 11 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
source "https://rubygems.org"

gem "nokogiri", "~> 1.19"

group :development, :test do
gem "irb"
gem "rake", "~> 13.0"
gem "rspec", "~> 3.13"
gem "simplecov", "~> 0.22", require: false
gem "standard", "~> 1.0"
end
181 changes: 181 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
GEM
remote: https://rubygems.org/
specs:
ast (2.4.3)
date (3.5.1)
diff-lcs (1.6.2)
docile (1.4.1)
erb (6.0.4)
io-console (0.8.2)
irb (1.18.0)
pp (>= 0.6.0)
prism (>= 1.3.0)
rdoc (>= 4.0.0)
reline (>= 0.4.2)
json (2.19.8)
language_server-protocol (3.17.0.5)
lint_roller (1.1.0)
nokogiri (1.19.3-aarch64-linux-gnu)
racc (~> 1.4)
nokogiri (1.19.3-aarch64-linux-musl)
racc (~> 1.4)
nokogiri (1.19.3-arm-linux-gnu)
racc (~> 1.4)
nokogiri (1.19.3-arm-linux-musl)
racc (~> 1.4)
nokogiri (1.19.3-arm64-darwin)
racc (~> 1.4)
nokogiri (1.19.3-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.19.3-x86_64-linux-gnu)
racc (~> 1.4)
nokogiri (1.19.3-x86_64-linux-musl)
racc (~> 1.4)
parallel (1.28.0)
parser (3.3.11.1)
ast (~> 2.4.1)
racc
pp (0.6.3)
prettyprint
prettyprint (0.2.0)
prism (1.9.0)
psych (5.4.0)
date
stringio
racc (1.8.1)
rainbow (3.1.1)
rake (13.4.2)
rdoc (7.2.0)
erb
psych (>= 4.0.0)
tsort
regexp_parser (2.12.0)
reline (0.6.3)
io-console (~> 0.5)
rspec (3.13.2)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.6)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.8)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.7)
rubocop (1.84.2)
json (~> 2.3)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.1.0)
parallel (~> 1.10)
parser (>= 3.3.0.2)
rainbow (>= 2.2.2, < 4.0)
regexp_parser (>= 2.9.3, < 3.0)
rubocop-ast (>= 1.49.0, < 2.0)
ruby-progressbar (~> 1.7)
unicode-display_width (>= 2.4.0, < 4.0)
rubocop-ast (1.49.1)
parser (>= 3.3.7.2)
prism (~> 1.7)
rubocop-performance (1.26.1)
lint_roller (~> 1.1)
rubocop (>= 1.75.0, < 2.0)
rubocop-ast (>= 1.47.1, < 2.0)
ruby-progressbar (1.13.0)
simplecov (0.22.0)
docile (~> 1.1)
simplecov-html (~> 0.11)
simplecov_json_formatter (~> 0.1)
simplecov-html (0.13.2)
simplecov_json_formatter (0.1.4)
standard (1.54.0)
language_server-protocol (~> 3.17.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.84.0)
standard-custom (~> 1.0.0)
standard-performance (~> 1.8)
standard-custom (1.0.2)
lint_roller (~> 1.0)
rubocop (~> 1.50)
standard-performance (1.9.0)
lint_roller (~> 1.1)
rubocop-performance (~> 1.26.0)
stringio (3.2.0)
tsort (0.2.0)
unicode-display_width (3.2.0)
unicode-emoji (~> 4.1)
unicode-emoji (4.2.0)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
irb
nokogiri (~> 1.19)
rake (~> 13.0)
rspec (~> 3.13)
simplecov (~> 0.22)
standard (~> 1.0)

CHECKSUMS
ast (2.4.3) sha256=954615157c1d6a382bc27d690d973195e79db7f55e9765ac7c481c60bdb4d383
date (3.5.1) sha256=750d06384d7b9c15d562c76291407d89e368dda4d4fff957eb94962d325a0dc0
diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
docile (1.4.1) sha256=96159be799bfa73cdb721b840e9802126e4e03dfc26863db73647204c727f21e
erb (6.0.4) sha256=38e3803694be357fe2bfe312487c74beaf9fb4e5beb3e22498952fe1645b95d9
io-console (0.8.2) sha256=d6e3ae7a7cc7574f4b8893b4fca2162e57a825b223a177b7afa236c5ef9814cc
irb (1.18.0) sha256=de9454a0703a54704b9811a5ef31a60c86949fbf4013fcf244fabc7c775248e3
json (2.19.8) sha256=6354310fd76ef69b87d5bd1f38b40d730613baf90b6803d2d0a48f618d32dfaa
language_server-protocol (3.17.0.5) sha256=fd1e39a51a28bf3eec959379985a72e296e9f9acfce46f6a79d31ca8760803cc
lint_roller (1.1.0) sha256=2c0c845b632a7d172cb849cc90c1bce937a28c5c8ccccb50dfd46a485003cc87
nokogiri (1.19.3-aarch64-linux-gnu) sha256=46b89e5d7b9e844c2ee360794240c6ea2a4e6fa0c5892a4ed487db621224b639
nokogiri (1.19.3-aarch64-linux-musl) sha256=8392dfdcd21be7a94dbbe9ccc138dea01b97b24cb2dc02a114ca98bfb1d9a0b7
nokogiri (1.19.3-arm-linux-gnu) sha256=3919d5ffc334ad778a4a9eb88fda7dcb8b1fb58c8a52ac640c6dcd2f038e774f
nokogiri (1.19.3-arm-linux-musl) sha256=9ce1cb6346bb9c67b1550eb537aa183ead91e4b6eadb2f36ade02d8dd2a79fb6
nokogiri (1.19.3-arm64-darwin) sha256=71b9bd424b1b7abc18b05052a1a3cfd3627abdca62be280854cc411791357e42
nokogiri (1.19.3-x86_64-darwin) sha256=77f3fba57d46c53ab31e62fc6c28f705109d1bf6264356c76f132b2be5728d4d
nokogiri (1.19.3-x86_64-linux-gnu) sha256=2f5078620fe12e83669b5b17311b32532a8153d02eee7ad06948b926d6080976
nokogiri (1.19.3-x86_64-linux-musl) sha256=248c906d2166eca5efb56d52fdee5f9a1f51d69a72e2b64fdac647b4ce39ea3f
parallel (1.28.0) sha256=33e6de1484baf2524792d178b0913fc8eb94c628d6cfe45599ad4458c638c970
parser (3.3.11.1) sha256=d17ace7aabe3e72c3cc94043714be27cc6f852f104d81aa284c2281aecc65d54
pp (0.6.3) sha256=2951d514450b93ccfeb1df7d021cae0da16e0a7f95ee1e2273719669d0ab9df6
prettyprint (0.2.0) sha256=2bc9e15581a94742064a3cc8b0fb9d45aae3d03a1baa6ef80922627a0766f193
prism (1.9.0) sha256=7b530c6a9f92c24300014919c9dcbc055bf4cdf51ec30aed099b06cd6674ef85
psych (5.4.0) sha256=14f72d69a611af663d7d70e4a7b67d9eb1f3ae9f8d916b478961d5a0075ba5b7
racc (1.8.1) sha256=4a7f6929691dbec8b5209a0b373bc2614882b55fc5d2e447a21aaa691303d62f
rainbow (3.1.1) sha256=039491aa3a89f42efa1d6dec2fc4e62ede96eb6acd95e52f1ad581182b79bc6a
rake (13.4.2) sha256=cb825b2bd5f1f8e91ca37bddb4b9aaf345551b4731da62949be002fa89283701
rdoc (7.2.0) sha256=8650f76cd4009c3b54955eb5d7e3a075c60a57276766ebf36f9085e8c9f23192
regexp_parser (2.12.0) sha256=35a916a1d63190ab5c9009457136ae5f3c0c7512d60291d0d1378ba18ce08ebb
reline (0.6.3) sha256=1198b04973565b36ec0f11542ab3f5cfeeec34823f4e54cebde90968092b1835
rspec (3.13.2) sha256=206284a08ad798e61f86d7ca3e376718d52c0bc944626b2349266f239f820587
rspec-core (3.13.6) sha256=a8823c6411667b60a8bca135364351dda34cd55e44ff94c4be4633b37d828b2d
rspec-expectations (3.13.5) sha256=33a4d3a1d95060aea4c94e9f237030a8f9eae5615e9bd85718fe3a09e4b58836
rspec-mocks (3.13.8) sha256=086ad3d3d17533f4237643de0b5c42f04b66348c28bf6b9c2d3f4a3b01af1d47
rspec-support (3.13.7) sha256=0640e5570872aafefd79867901deeeeb40b0c9875a36b983d85f54fb7381c47c
rubocop (1.84.2) sha256=5692cea54168f3dc8cb79a6fe95c5424b7ea893c707ad7a4307b0585e88dbf5f
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
rubocop-performance (1.26.1) sha256=cd19b936ff196df85829d264b522fd4f98b6c89ad271fa52744a8c11b8f71834
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
simplecov (0.22.0) sha256=fe2622c7834ff23b98066bb0a854284b2729a569ac659f82621fc22ef36213a5
simplecov-html (0.13.2) sha256=bd0b8e54e7c2d7685927e8d6286466359b6f16b18cb0df47b508e8d73c777246
simplecov_json_formatter (0.1.4) sha256=529418fbe8de1713ac2b2d612aa3daa56d316975d307244399fa4838c601b428
standard (1.54.0) sha256=7a4b08f83d9893083c8f03bc486f0feeb6a84d48233b40829c03ef4767ea0100
standard-custom (1.0.2) sha256=424adc84179a074f1a2a309bb9cf7cd6bfdb2b6541f20c6bf9436c0ba22a652b
standard-performance (1.9.0) sha256=49483d31be448292951d80e5e67cdcb576c2502103c7b40aec6f1b6e9c88e3f2
stringio (3.2.0) sha256=c37cb2e58b4ffbd33fe5cd948c05934af997b36e0b6ca6fdf43afa234cf222e1
tsort (0.2.0) sha256=9650a793f6859a43b6641671278f79cfead60ac714148aabe4e3f0060480089f
unicode-display_width (3.2.0) sha256=0cdd96b5681a5949cdbc2c55e7b420facae74c4aaf9a9815eee1087cb1853c42
unicode-emoji (4.2.0) sha256=519e69150f75652e40bf736106cfbc8f0f73aa3fb6a65afe62fefa7f80b0f80f

BUNDLED WITH
4.0.6
Loading