feat: v2 SDK rewrite with Pydantic + httpx by FrancescoSaverioZuppichini · Pull Request #84 · ScrapeGraphAI/scrapegraph-py

FrancescoSaverioZuppichini · 2026-04-14T20:54:24Z

Summary

Complete SDK rewrite matching the JS SDK 1:1:

Pydantic v2 for all request/response models with automatic camelCase serialization
httpx for sync and async HTTP clients
ApiResult[T] wrapper pattern - no exceptions, just status: "success" | "error"
Nested resources - sgai.crawl.start(), sgai.monitor.create(), sgai.history.list()
uv as package manager with modern src/ layout

Changes

Restructured from nested scrapegraph-py/ to root-level uv library
All Pydantic models in single schemas.py with CamelModel base class
Sync client (ScrapeGraphAI) and async client (AsyncScrapeGraphAI)
32 examples (16 sync + 16 async) for all endpoints
28 unit tests with mocked httpx
Simplified CI workflows for uv

API Surface

from scrapegraph_py import ScrapeGraphAI, ScrapeRequest

sgai = ScrapeGraphAI()  # reads SGAI_API_KEY from env
result = sgai.scrape(ScrapeRequest(url="https://example.com"))

if result.status == "success":
    print(result.data["results"]["markdown"]["data"])

Test plan

All 28 unit tests pass (uv run pytest tests/test_client.py -v)
Integration tests pass with real API key
Lint passes (uv run ruff check .)
CI workflow runs successfully

🤖 Generated with Claude Code

- Delete .agent/ documentation folder (unused) - Simplify CLAUDE.md from 370 to ~90 lines - Remove stale docs (HEALTHCHECK.md, IMPLEMENTATION_SUMMARY.md, TOON_INTEGRATION_SUMMARY.md) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

BREAKING CHANGE: Complete project restructure - Remove nested scrapegraph-py/ folder - Initialize as uv library with src/ layout - Clean slate for v2 API rewrite Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add ScrapeGraphAI sync client with httpx - Add AsyncScrapeGraphAI async client - Add Pydantic models for all request/response types - Add nested resources: crawl, monitor, history - Return ApiResult wrapper (never raises) - Support SGAI_API_KEY, SGAI_DEBUG, SGAI_TIMEOUT_S env vars API surface: - client.scrape(ScrapeRequest) - client.extract(ExtractRequest) - client.search(SearchRequest) - client.credits() - client.health() - client.crawl.start/get/stop/resume/delete - client.monitor.create/list/get/update/delete/pause/resume - client.history.list/get Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- scrape: basic, json extraction, pdf, multi-format, fetchconfig - extract: basic, with schema - search: basic, with extraction - crawl: basic, with formats - monitor: basic, with webhook - utilities: credits, health, history Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Delete types.py, everything in schemas.py - Remove Api prefix from response models - Pre-compile server timing regex - Fix json field shadowing with aliases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Follows Pydantic v2 best practices for type safety Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Test credits, scrape, extract, search, history, crawl - Fix HttpUrl serialization (mode="json" in model_dump) - Add python-dotenv for loading .env Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace manual _to_camel with Pydantic's built-in alias_generator - CamelModel base class handles snake_case -> camelCase conversion - Simplify _serialize to single model_dump call - Add async versions of all 16 examples - Update README with expanded async client docs and examples table - Add banner from JS SDK Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add .pytest_cache/, .ruff_cache/, .mypy_cache/ to gitignore - Add common Python build/test artifacts - Remove obsolete update-requirements.yml workflow Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove obsolete pylint.yml and test.yml (referenced old structure) - Add ci.yml with simple lint + test jobs using uv - Update release.yml for root-level project - Update python-publish.yml for uv build Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Test request construction, response parsing, error handling - Mock httpx.Client.request instead of hitting real API - Test all endpoints: scrape, extract, search, crawl, monitor, history - Test HTTP errors (401, 402, 429), timeouts - Test camelCase serialization - Update CI to run test_client.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-04-14T20:54:39Z

Dependency Review

The following issues were found:

✅ 0 vulnerable package(s)
✅ 0 package(s) with incompatible licenses
✅ 0 package(s) with invalid SPDX license definitions
⚠️ 3 package(s) with unknown licenses.

See the Details below.

License Issues

uv.lock

Package	Version	License	Issue Type
pydantic	2.13.0	Null	Unknown License
pytest	9.0.3	Null	Unknown License
ruff	0.15.10	Null	Unknown License

OpenSSF Scorecard

Scorecard details

Package

Version

Score

Details

pip/annotated-types

0.7.0

Unknown

pip/anyio

4.13.0

Unknown

pip/certifi

2026.2.25

🟢 6.6

Details

Check	Score	Reason
Code-Review	🟢 5	Found 1/2 approved changesets -- score normalized to 5
Maintained	🟢 10	10 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 10
Binary-Artifacts	🟢 10	no binaries found in the repo
Security-Policy	🟢 10	security policy file detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Token-Permissions	🟢 10	GitHub workflow tokens follow principle of least privilege
Pinned-Dependencies	🟢 5	dependency not pinned by hash detected -- score normalized to 5
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
Fuzzing	⚠️ 0	project is not fuzzed
License	🟢 9	license file detected
Signed-Releases	⚠️ -1	no releases found
Packaging	🟢 10	packaging workflow detected
Branch-Protection	⚠️ 0	branch protection not enabled on development/release branches
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0

pip/colorama

0.4.6

Unknown

pip/h11

0.16.0

🟢 4.4

Details

Check	Score	Reason
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Maintained	⚠️ 0	0 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 0
Binary-Artifacts	🟢 10	no binaries found in the repo
Code-Review	🟢 5	Found 9/18 approved changesets -- score normalized to 5
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Packaging	⚠️ -1	packaging workflow not detected
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
Security-Policy	⚠️ 0	security policy file not detected
Fuzzing	🟢 10	project is fuzzed
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0

pip/httpcore

1.0.9

Unknown

pip/httpx

0.28.1

Unknown

pip/idna

3.11

Unknown

pip/iniconfig

2.3.0

Unknown

pip/packaging

26.0

Unknown

pip/pluggy

1.6.0

Unknown

pip/pydantic

2.13.0

Unknown

pip/pydantic-core

2.46.0

🟢 6.7

Details

Check	Score	Reason
Code-Review	🟢 10	all changesets reviewed
Maintained	🟢 10	30 commit(s) and 16 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Binary-Artifacts	🟢 10	no binaries found in the repo
License	🟢 10	license file detected
Pinned-Dependencies	🟢 8	dependency not pinned by hash detected -- score normalized to 8
Fuzzing	🟢 10	project is fuzzed
Signed-Releases	⚠️ 0	Project has not signed or included provenance with any releases.
Branch-Protection	🟢 4	branch protection is not maximal on development and all release branches
Security-Policy	🟢 10	security policy file detected
Packaging	🟢 10	packaging workflow detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0

pip/pygments

2.20.0

Unknown

pip/pytest

9.0.3

Unknown

pip/pytest-asyncio

1.3.0

Unknown

pip/python-dotenv

1.2.2

Unknown

pip/ruff

0.15.10

Unknown

pip/typing-extensions

4.15.0

Unknown

pip/typing-inspection

0.4.2

Unknown

Scanned Files

.github/workflows/test.yml
scrapegraph-py/requirements-test.txt
scrapegraph-py/uv.lock
uv.lock

- Run ruff format on src/ - Add ruff config to pyproject.toml (line-length=100, ignore E501) - Fix import ordering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Format test files with ruff - Add per-file ignores for tests (F841, E402) - Update CI to check src/ tests/ only Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add ResponseModel base class with camelCase alias generator - Change all response models to inherit from ResponseModel - Use TypeAdapter for proper generic type parsing - Update all examples to use attribute access (res.data.results) - Fix all test mocks with complete required fields This follows industry standard SDK patterns where typed objects are returned for IDE autocompletion and type safety. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Test minimum supported version and latest stable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Convert dict-style access to Pydantic attribute access in all examples - Add polling loop to crawl examples (matches JS SDK behavior) - Add dotenv loading to all examples for easier local testing - Fix health endpoint to use /health instead of /healthz - Update CLAUDE.md with pre-commit checklist using ruff Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Default URL: https://api.scrapegraphai.com/api/v2 - Env var: SGAI_TIMEOUT_S -> SGAI_TIMEOUT Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Rebase base URL, env vars, and auth header onto the new scrapegraph-py v2 SDK contract (ScrapeGraphAI/scrapegraph-py#84): - Base URL: /api/v2 -> /v2 (default https://api.scrapegraphai.com/v2) - Env: SGAI_API_URL (SCRAPEGRAPH_API_BASE_URL kept as legacy alias) - Env: SGAI_TIMEOUT_S for httpx timeout (default 120s) - Drop Authorization: Bearer; keep SGAI-APIKEY only (matches SDK) - Update docstrings, resources, README, server.json, .agent docs to reference #84 and the /v2 base URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add MonitorActivityRequest, MonitorActivityResponse, MonitorTickEntry schemas - Add activity() method to MonitorResource (sync and async) - Update monitor examples to use activity() and show diffs nicely - Delete monitor on Ctrl+C cleanup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add seen_ids deduplication - Cleanup in signal handler directly - Show "(no diffs data)" when changed but no diffs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…Store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolved conflicts: - scrapegraph-py/CHANGELOG.md, pyproject.toml, client.py, async_client.py: accepted PR's delete — old subdir is being removed in v2 restructure to root - package.json: auto-merged repo URL update from main

VinciGit00 · 2026-04-18T14:00:34Z

Test results (updated)

Conflict resolution

Merged origin/main into feat/v2-migration (commit 7eca9cc):

scrapegraph-py/CHANGELOG.md, scrapegraph-py/pyproject.toml, scrapegraph-py/scrapegraph_py/client.py, scrapegraph-py/scrapegraph_py/async_client.py — accepted the PR's delete. Main only added a v1.12.3 deprecation notice to the old nested package; this PR restructures to a root-level src/ layout, so the old subdir is gone by design.
package.json — auto-merged (repo URL update from main)

Local validation (all green)

uv sync ✅
uv run ruff check src/ tests/ ✅ — all checks passed
uv run pytest tests/ -v ✅ — 28/28 pass (1 skipped) in 0.34s

Note: uv run ruff check examples/ reports 127 style errors (mostly I001 import sort + E402 because load_dotenv() is intentionally called before imports). Cosmetic, scoped to examples, not blocking.

Live integration tests — ⚠️ 2 bugs found

Ran against https://sgai-api-v2.onrender.com/api/v2 (production api.scrapegraphai.com rejects this key). See the follow-up comment for full details — the short version:

✅ credits, health(), scrape (default + fetch_config), extract, search, full crawl lifecycle all pass
❌ history.list crashes with a Pydantic validation error — HistoryEntry.result is typed as dict but the API returns null for pending/errored entries. Fix: result: dict | None = None at src/scrapegraph_py/schemas.py:442
❌ Method is sgai.health() but JS SDK uses sgai.healthy() — breaks the "1:1 parity" claim in the PR description. Pick one name and align both SDKs.

VinciGit00 · 2026-04-18T14:02:41Z

Live integration results — 2 bugs found ⚠️

Ran against https://sgai-api-v2.onrender.com/api/v2 (staging — production api.scrapegraphai.com rejects this key as invalid).

✅ Passing

Test	Status	Elapsed
`credits`	✅	—
`health()`	✅	71 ms
`scrape` (default markdown)	✅	177 ms
`scrape` with `fetch_config={mode:fast, timeout:15000}`	✅	232 ms
`extract`	✅	645 ms
`search` (num_results: 2)	✅	1252 ms
`crawl.start` → `crawl.get` → `crawl.stop`	✅	—

❌ Bugs

1. history.list crashes on entries with null result

1 validation error for HistoryPage
data.4.result
  Input should be a valid dictionary [type=dict_type, input_value=None, input_type=NoneType]

schemas.py:442 declares result: dict on HistoryEntry, but the API returns null for pending/errored/in-flight entries. Fix: make it optional.

# src/scrapegraph_py/schemas.py:442
result: dict | None = None

2. Method named health(), but JS SDK uses healthy() — breaks 1:1 parity claim

PR description says "SDK matching the JS SDK 1:1". JS exposes sgai.healthy(); Python exposes sgai.health(). Pick one and align both SDKs (the JS PR #13 uses healthy). Affects:

src/scrapegraph_py/client.py:242
src/scrapegraph_py/async_client.py (same method)
Examples under examples/utilities/health*.py

Credits consumed during Python run: ~12 (475 → 463). SDK core works; only the history deserialization is a real functional bug, the health vs healthy is a consistency issue.

FrancescoSaverioZuppichini and others added 11 commits April 14, 2026 22:10

feat!: restructure as uv library project

731d70d

BREAKING CHANGE: Complete project restructure - Remove nested scrapegraph-py/ folder - Initialize as uv library with src/ layout - Clean slate for v2 API rewrite Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor: merge types into schemas, all Pydantic

e1d8033

- Delete types.py, everything in schemas.py - Remove Api prefix from response models - Pre-compile server timing regex - Fix json field shadowing with aliases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: use ConfigDict for Pydantic model_config

45846ba

Follows Pydantic v2 best practices for type safety Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add integration tests matching JS SDK

d1faf57

- Test credits, scrape, extract, search, history, crawl - Fix HttpUrl serialization (mode="json" in model_dump) - Add python-dotenv for loading .env Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: update .gitignore, remove update-requirements.yml

22dbab5

- Add .pytest_cache/, .ruff_cache/, .mypy_cache/ to gitignore - Add common Python build/test artifacts - Remove obsolete update-requirements.yml workflow Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

FrancescoSaverioZuppichini and others added 9 commits April 14, 2026 22:59

chore: format code with ruff, add ruff config

1637a05

- Run ruff format on src/ - Add ruff config to pyproject.toml (line-length=100, ignore E501) - Fix import ordering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: fix lint for tests, add per-file ignores

3b4990c

- Format test files with ruff - Add per-file ignores for tests (F841, E402) - Update CI to check src/ tests/ only Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add CONTRIBUTING.md with lint/format instructions

e9fbc47

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: update CONTRIBUTING.md to match JS SDK structure

b190d03

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ci: test on Python 3.12 and 3.14

500b442

Test minimum supported version and latest stable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: remove mypy from CLAUDE.md, CI only uses ruff

f53c01e

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: update default API URL and rename SGAI_TIMEOUT_S to SGAI_TIMEOUT

13902a9

- Default URL: https://api.scrapegraphai.com/api/v2 - Env var: SGAI_TIMEOUT_S -> SGAI_TIMEOUT Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

FrancescoSaverioZuppichini and others added 6 commits April 15, 2026 11:45

fix: update health test to mock httpx.Client.request

bac1751

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

style: apply ruff format to async_client.py

3ad0eee

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: match monitor examples to JS SDK style

d00efeb

- Add seen_ids deduplication - Cleanup in signal handler directly - Show "(no diffs data)" when changed but no diffs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: link banner image to scrapegraphai.com and remove tracked .DS_…

6606a27

…Store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: merge main into feat/v2-migration

7eca9cc

Resolved conflicts: - scrapegraph-py/CHANGELOG.md, pyproject.toml, client.py, async_client.py: accepted PR's delete — old subdir is being removed in v2 restructure to root - package.json: auto-merged repo URL update from main

chore: remove .python-version

f61c780

VinciGit00 approved these changes Apr 18, 2026

View reviewed changes

VinciGit00 merged commit 2dd5809 into main Apr 18, 2026
6 checks passed

VinciGit00 deleted the feat/v2-migration branch April 19, 2026 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v2 SDK rewrite with Pydantic + httpx#84

feat: v2 SDK rewrite with Pydantic + httpx#84
VinciGit00 merged 27 commits intomainfrom
feat/v2-migration

FrancescoSaverioZuppichini commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

VinciGit00 commented Apr 18, 2026 •

edited

Loading

Uh oh!

VinciGit00 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FrancescoSaverioZuppichini commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

API Surface

Test plan

Uh oh!

github-actions bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

License Issues

uv.lock

OpenSSF Scorecard

Scanned Files

Uh oh!

VinciGit00 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test results (updated)

Conflict resolution

Local validation (all green)

Live integration tests — ⚠️ 2 bugs found

Uh oh!

VinciGit00 commented Apr 18, 2026

Live integration results — 2 bugs found ⚠️

✅ Passing

❌ Bugs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FrancescoSaverioZuppichini commented Apr 14, 2026 •

edited

Loading

github-actions bot commented Apr 14, 2026 •

edited

Loading

VinciGit00 commented Apr 18, 2026 •

edited

Loading