feat: v2 SDK rewrite with Pydantic + httpx#84
Conversation
- Delete .agent/ documentation folder (unused) - Simplify CLAUDE.md from 370 to ~90 lines - Remove stale docs (HEALTHCHECK.md, IMPLEMENTATION_SUMMARY.md, TOON_INTEGRATION_SUMMARY.md) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BREAKING CHANGE: Complete project restructure - Remove nested scrapegraph-py/ folder - Initialize as uv library with src/ layout - Clean slate for v2 API rewrite Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ScrapeGraphAI sync client with httpx - Add AsyncScrapeGraphAI async client - Add Pydantic models for all request/response types - Add nested resources: crawl, monitor, history - Return ApiResult wrapper (never raises) - Support SGAI_API_KEY, SGAI_DEBUG, SGAI_TIMEOUT_S env vars API surface: - client.scrape(ScrapeRequest) - client.extract(ExtractRequest) - client.search(SearchRequest) - client.credits() - client.health() - client.crawl.start/get/stop/resume/delete - client.monitor.create/list/get/update/delete/pause/resume - client.history.list/get Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- scrape: basic, json extraction, pdf, multi-format, fetchconfig - extract: basic, with schema - search: basic, with extraction - crawl: basic, with formats - monitor: basic, with webhook - utilities: credits, health, history Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Delete types.py, everything in schemas.py - Remove Api prefix from response models - Pre-compile server timing regex - Fix json field shadowing with aliases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Follows Pydantic v2 best practices for type safety Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Test credits, scrape, extract, search, history, crawl - Fix HttpUrl serialization (mode="json" in model_dump) - Add python-dotenv for loading .env Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace manual _to_camel with Pydantic's built-in alias_generator - CamelModel base class handles snake_case -> camelCase conversion - Simplify _serialize to single model_dump call - Add async versions of all 16 examples - Update README with expanded async client docs and examples table - Add banner from JS SDK Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add .pytest_cache/, .ruff_cache/, .mypy_cache/ to gitignore - Add common Python build/test artifacts - Remove obsolete update-requirements.yml workflow Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove obsolete pylint.yml and test.yml (referenced old structure) - Add ci.yml with simple lint + test jobs using uv - Update release.yml for root-level project - Update python-publish.yml for uv build Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Test request construction, response parsing, error handling - Mock httpx.Client.request instead of hitting real API - Test all endpoints: scrape, extract, search, crawl, monitor, history - Test HTTP errors (401, 402, 429), timeouts - Test camelCase serialization - Update CI to run test_client.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dependency ReviewThe following issues were found:
License Issuesuv.lock
OpenSSF ScorecardScorecard details
Scanned Files
|
- Run ruff format on src/ - Add ruff config to pyproject.toml (line-length=100, ignore E501) - Fix import ordering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Format test files with ruff - Add per-file ignores for tests (F841, E402) - Update CI to check src/ tests/ only Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ResponseModel base class with camelCase alias generator - Change all response models to inherit from ResponseModel - Use TypeAdapter for proper generic type parsing - Update all examples to use attribute access (res.data.results) - Fix all test mocks with complete required fields This follows industry standard SDK patterns where typed objects are returned for IDE autocompletion and type safety. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test minimum supported version and latest stable. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert dict-style access to Pydantic attribute access in all examples - Add polling loop to crawl examples (matches JS SDK behavior) - Add dotenv loading to all examples for easier local testing - Fix health endpoint to use /health instead of /healthz - Update CLAUDE.md with pre-commit checklist using ruff Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Default URL: https://api.scrapegraphai.com/api/v2 - Env var: SGAI_TIMEOUT_S -> SGAI_TIMEOUT Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rebase base URL, env vars, and auth header onto the new scrapegraph-py v2 SDK contract (ScrapeGraphAI/scrapegraph-py#84): - Base URL: /api/v2 -> /v2 (default https://api.scrapegraphai.com/v2) - Env: SGAI_API_URL (SCRAPEGRAPH_API_BASE_URL kept as legacy alias) - Env: SGAI_TIMEOUT_S for httpx timeout (default 120s) - Drop Authorization: Bearer; keep SGAI-APIKEY only (matches SDK) - Update docstrings, resources, README, server.json, .agent docs to reference #84 and the /v2 base URL. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add MonitorActivityRequest, MonitorActivityResponse, MonitorTickEntry schemas - Add activity() method to MonitorResource (sync and async) - Update monitor examples to use activity() and show diffs nicely - Delete monitor on Ctrl+C cleanup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add seen_ids deduplication - Cleanup in signal handler directly - Show "(no diffs data)" when changed but no diffs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved conflicts: - scrapegraph-py/CHANGELOG.md, pyproject.toml, client.py, async_client.py: accepted PR's delete — old subdir is being removed in v2 restructure to root - package.json: auto-merged repo URL update from main
Test results (updated)Conflict resolutionMerged
Local validation (all green)
Note: Live integration tests —
|
Live integration results — 2 bugs found
|
| Test | Status | Elapsed |
|---|---|---|
credits |
✅ | — |
health() |
✅ | 71 ms |
scrape (default markdown) |
✅ | 177 ms |
scrape with fetch_config={mode:fast, timeout:15000} |
✅ | 232 ms |
extract |
✅ | 645 ms |
search (num_results: 2) |
✅ | 1252 ms |
crawl.start → crawl.get → crawl.stop |
✅ | — |
❌ Bugs
1. history.list crashes on entries with null result
1 validation error for HistoryPage
data.4.result
Input should be a valid dictionary [type=dict_type, input_value=None, input_type=NoneType]
schemas.py:442 declares result: dict on HistoryEntry, but the API returns null for pending/errored/in-flight entries. Fix: make it optional.
# src/scrapegraph_py/schemas.py:442
result: dict | None = None2. Method named health(), but JS SDK uses healthy() — breaks 1:1 parity claim
PR description says "SDK matching the JS SDK 1:1". JS exposes sgai.healthy(); Python exposes sgai.health(). Pick one and align both SDKs (the JS PR #13 uses healthy). Affects:
src/scrapegraph_py/client.py:242src/scrapegraph_py/async_client.py(same method)- Examples under
examples/utilities/health*.py
Credits consumed during Python run: ~12 (475 → 463). SDK core works; only the history deserialization is a real functional bug, the health vs healthy is a consistency issue.
Summary
Complete SDK rewrite matching the JS SDK 1:1:
status: "success" | "error"sgai.crawl.start(),sgai.monitor.create(),sgai.history.list()src/layoutChanges
scrapegraph-py/to root-level uv libraryschemas.pywithCamelModelbase classScrapeGraphAI) and async client (AsyncScrapeGraphAI)API Surface
Test plan
uv run pytest tests/test_client.py -v)uv run ruff check .)🤖 Generated with Claude Code