Skip to content

feat: add ONNX Runtime inference backend with PyTorch fallback (#12)#32

Open
jshaofa-ui wants to merge 68 commits into
Climate-Vision:mainfrom
jshaofa-ui:feature/onnx-runtime-inference
Open

feat: add ONNX Runtime inference backend with PyTorch fallback (#12)#32
jshaofa-ui wants to merge 68 commits into
Climate-Vision:mainfrom
jshaofa-ui:feature/onnx-runtime-inference

Conversation

@jshaofa-ui

Copy link
Copy Markdown

[Good First Issue] Add ONNX Runtime inference path with PyTorch fallback

Resolves #12

Summary

Implements a complete ONNX Runtime inference backend for ClimateVision with automatic PyTorch fallback, enabling faster inference on CPU and edge devices while maintaining full compatibility with existing PyTorch models.

Changes

  • onnx_runtime.py (540 lines) - ONNX Runtime engine with session management, inference, and benchmarking
  • onnx_export.py (422 lines) - PyTorch to ONNX model export for U-Net and Siamese networks
  • init.py (67 lines) - Unified module API combining PyTorch and ONNX inference
  • test_onnx_runtime.py (873 lines) - 32 unit tests across 11 test classes
  • onnx-runtime-guide.md - Complete usage documentation

Core Features

  1. ONNXSession - Cached session manager with automatic CPU/CUDA provider selection
  2. run_onnx_inference() - Batch inference with latency tracking
  3. benchmark_onnx_model() - Full benchmarking (p50/p95/p99 latency + FPS)
  4. export_unet_to_onnx() / export_siamese_to_onnx() - Dynamic axis, configurable opset
  5. run_inference_with_fallback() - Automatic ONNX to PyTorch fallback
  6. validate_onnx_model() - Cross-validation with PyTorch output

Test Coverage

  • 11 test classes: ONNXSession caching, device selection, inference, benchmarking, export, validation, fallback, integration
  • 32 unit tests total
  • Graceful skip when torch/onnx not available

Technical Details

  • Zero breaking changes to existing inference pipeline
  • Automatic provider selection based on hardware availability
  • Session caching for repeated inference calls
  • Full numerical validation against PyTorch baseline

Oshgig and others added 30 commits March 8, 2026 20:34
- Expanded config.yaml with per-analysis-type configuration for
  deforestation, ice melting, and flooding including band configs,
  alert thresholds, and model paths
- Added config/train.yaml for production training configuration
- Expanded db.py with full SQLite schema: organisations, subscriptions,
  alerts tables; API key generation; all CRUD operations
- Added requirements-install.txt for streamlined dependency installation

Co-authored-by: Adeolu Mary Oshadare <nifemi996@gmail.com>
Co-authored-by: John Edoh Onuh <onuhj47@gmail.com>
Co-authored-by: Francis Umo <Francisumoh@360yahoo.com>
Co-authored-by: Olufemi Taiwo <olufemitaiwo23@gmail.com>
Co-authored-by: Godswill Chukwu Okoroafor <godswillchukwu21@gmail.com>
- Added inference/pipeline.py: full GEE-integrated inference engine
  with NDVI computation, model loading, file and bbox inference paths,
  synthetic NDVI fallback with bbox-seeded reproducibility
- Updated inference/__init__.py to export run_inference,
  run_inference_from_file, run_inference_from_gee
- Added analysis/ module: base class, registry, and dedicated analysers
  for deforestation, flooding and ice melting detection
- Added training/ module: production trainer with EMA, checkpointing,
  early stopping, and combined loss functions (BCE + Dice + Focal)
- Updated models/unet.py with minor architecture improvements
- Updated __init__.py package exports

Co-authored-by: Adeolu Mary Oshadare <nifemi996@gmail.com>
Co-authored-by: Francis Umo <Francisumoh@360yahoo.com>
Co-authored-by: Godswill Chukwu Okoroafor <godswillchukwu21@gmail.com>
Co-authored-by: Victor Mbachu <victor.c.mbachu@gmail.com>
- Expanded api/main.py with full production API: organisation and NGO
  management, subscription system, alert and notification endpoints,
  all three analysis types wired to inference pipeline, run history,
  file upload endpoint, health check and API key authentication
- Added run_api.sh: server startup script with venv activation,
  environment setup and uvicorn hot-reload configuration
- Added docs/API_REFERENCE.md: full endpoint reference with request
  and response schemas for all routes

Co-authored-by: Adeolu Mary Oshadare <nifemi996@gmail.com>
Co-authored-by: John Edoh Onuh <onuhj47@gmail.com>
Co-authored-by: Olufemi Taiwo <olufemitaiwo23@gmail.com>
Co-authored-by: Victor Mbachu <victor.c.mbachu@gmail.com>
Co-authored-by: Godswill Chukwu Okoroafor <godswillchukwu21@gmail.com>
- Fixed repository clone URL to Climate-Vision/ClimateVision
- Updated Quick Start to use run_api.sh instead of raw uvicorn command
- Corrected tech stack: SQLite (not PostgreSQL), Google Maps API (not Leaflet)
- Fixed API Reference doc link to docs/API_REFERENCE.md
- Updated Phase 3 roadmap to reflect Google Maps and Recharts as completed
- Fixed Star History tracking link

Co-authored-by: Adeolu Mary Oshadare <nifemi996@gmail.com>
Co-authored-by: John Edoh Onuh <onuhj47@gmail.com>
Co-authored-by: Francis Umo <Francisumoh@360yahoo.com>
Co-authored-by: Olufemi Taiwo <olufemitaiwo23@gmail.com>
Co-authored-by: Godswill Chukwu Okoroafor <godswillchukwu21@gmail.com>
Co-authored-by: Victor Mbachu <victor.c.mbachu@gmail.com>
Co-authored-by: Paul <46930375+cutewizzy11@users.noreply.github.com>
…ipts

- prepare_data.py: GEE + synthetic Sentinel-2 patch downloader with
  Dynamic World forest labels, train/val/test split, normalizer fitting
- train.py: production Attention U-Net training entry-point with YAML
  config, focal+dice loss, EMA weights, cosine LR schedule, early stopping
- run_training.py: end-to-end training + inference pipeline
- evaluate.py: per-class IoU/F1/precision/recall on held-out test set
- export_model.py: ONNX and TorchScript model export
- infer.py: CLI inference runner for single images or GEE bbox

Co-Authored-By: Emmanuel Edoh <edoh-Onuh@users.noreply.github.com>
Co-Authored-By: Godswill Okoroafor <godswillchukwu21@gmail.com>
Co-Authored-By: Gold Okpa <okpagold@gmail.com>
Co-Authored-By: Victor Mbachu <victor.c.mbachu@gmail.com>
- pipeline.py: authenticate GEE via service account key when
  GEE_SERVICE_ACCOUNT and GEE_SERVICE_ACCOUNT_KEY env vars are set;
  falls back to synthetic NDVI when GEE is unavailable instead of zeros
- .gitignore: protect secrets/ directory and *.json key files

Co-Authored-By: Gold Okpa <okpagold@gmail.com>
Notebook handles: GEE service account auth, multi-region patch download
(Amazon/Congo/Borneo), Attention U-Net training on T4 GPU, evaluation,
and checkpoint download back to local machine.

Co-Authored-By: Gold Okpa <okpagold@gmail.com>
- prepare_data.py: reads GEE_SERVICE_ACCOUNT / GEE_SERVICE_ACCOUNT_KEY
  env vars to authenticate via service account instead of requiring
  earthengine authenticate
- notebook: sets env vars with absolute key path in Cell 3 so all
  subprocess calls in Cells 5 and 6 inherit them automatically

Co-Authored-By: Gold Okpa <okpagold@gmail.com>
Split each region into 0.5° tiles at 30m resolution instead of
downloading the whole bbox at 10m (which hit GEE's pixel grid cap).
Each tile is ~1850x1850px — well under the 32768 limit.
Patches are accumulated across tiles until max_patches is reached.

Co-Authored-By: Gold Okpa <okpagold@gmail.com>
…E limit

Previous 30m/0.5° tiles were ~130MB each, exceeding GEE's 48MB cap.
At 100m resolution each 0.25° tile is ~1.5MB — well within limits.
Also fixes NameError on profile when all tiles failed, and adds a
clear error exit when no patches are extracted.

Co-Authored-By: Gold Okpa <okpagold@gmail.com>
… config

- App.tsx: main application shell with routing, global state and
  sidebar navigation between Dashboard, Analysis, NGO and Settings
- api.ts: typed API client for all backend endpoints (predict, runs,
  organizations, alerts, analysis-types) with error handling
- types.ts: shared TypeScript interfaces for Run, Organization,
  Alert, NDVIStats, InferenceResult and API responses
- styles.css: design-system CSS variables (cv-* tokens), component
  base styles, skeleton loader, scrollbar and animation utilities
- tailwind.config.js: extended theme with cv-* color palette, shadow
  tokens, and custom font stack matching the dark forest UI
- main.tsx: React 18 createRoot entry-point with StrictMode
- index.html: updated meta tags, font preload and app title
- package.json: added lucide-react, recharts, react-router-dom deps
- .env.example: documents VITE_GOOGLE_MAPS_API_KEY and VITE_API_BASE_URL

Co-Authored-By: Emmanuel Edoh <edoh-Onuh@users.noreply.github.com>
Co-Authored-By: Adeolu Mary Oshadare <nifemi996@gmail.com>
Co-Authored-By: Gold Okpa <okpagold@gmail.com>
Co-Authored-By: Victor Mbachu <victor.c.mbachu@gmail.com>
- Validate bbox has exactly 4 values [west, south, east, north]
- Enforce longitude bounds (-180 to 180) and latitude bounds (-90 to 90)
- Ensure west < east and south < north
- Validate date strings follow YYYY-MM-DD format
- Ensure start_date is earlier than end_date
- Add offset query parameter for cursor-based pagination
- Return total record count alongside results for frontend page controls
- Restructure response to {total, limit, offset, runs} envelope
- Refactor WHERE clause building to avoid SQL injection via safe parameterisation
- Returns total run count, completed runs in last 7 days
- Breakdown by status (pending, running, completed, failed)
- Breakdown by analysis type (deforestation, ice_melting, flooding)
- Alert summary: total alerts and unacknowledged count
- Feeds directly into the frontend Dashboard KPI summary cards
- Log every request: method, path, status code, duration_ms, client IP
- Attach X-Response-Time-Ms header to all responses for frontend monitoring
- Uses Starlette BaseHTTPMiddleware for non-blocking request interception
- Helps trace slow endpoints and detect unusual access patterns in production
- Reduce from 874 lines to ~100 lines (~5000 words to 596 words)
- Move installation to top (line 18) - visible without scrolling
- Replace imaginary API examples with real working curl + uvicorn commands
- Replace fabricated benchmarks with honest in-progress markers
- Remove community growth strategy, team descriptions, and execution plan
- Add satellite band details to analysis types table
- Keep citation, contributing, and docs links
…lufemi-improvements

feat(api): Olufemi - API validation, pagination, stats & audit logging
- Add React components: Map, Charts, Layout, UI elements
- Add contexts: AppContext, ToastContext
- Add hooks: useGeocoding, useRunPolling
- Add pages: Analytics, NewAnalysis, RunHistory, Settings, Upload
- Update SETUP_COMPLETE.md

Co-authored-by: Adeolu Mary Oshadare <nifemi996@gmail.com>
Co-authored-by: John Edoh Onuh <onuhj47@gmail.com>
Co-authored-by: Francis Umo <Francisumoh@360yahoo.com>
Co-authored-by: Olufemi Taiwo <olufemitaiwo23@gmail.com>
Co-authored-by: Godswill Chukwu Okoroafor <godswillchukwu21@gmail.com>
Co-authored-by: Victor Mbachu <victor.c.mbachu@gmail.com>
Co-authored-by: Paul <46930375+cutewizzy11@users.noreply.github.com>
Co-authored-by: Gold Okpa <okpagold@gmail.com>
Prevent accidental commits of large .pth model files that exceed GitHub's 100MB limit.
Add centralized constants for API config, map settings, analysis types, polling intervals, and UI configurations.
Francis Umo and others added 25 commits March 28, 2026 21:19
Implements comprehensive statistical analysis capabilities:
- Two-sample t-test (Welch's) for model comparison
- Mann-Whitney U test for non-parametric analysis
- Linear trend analysis for time series
- A/B testing framework for model version comparison
- Bootstrap confidence intervals for uncertainty estimation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements automated report generation for stakeholders:
- RegionalMetrics dataclass for environmental KPIs
- ImpactReport with carbon, validation, and recommendations
- ReportGenerator for JSON and HTML report output
- Trend analysis integration for year-over-year comparisons
- Actionable recommendations based on threshold analysis

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updates module __init__.py to expose complete public API:
- Carbon estimation: CarbonEstimator, estimate_carbon
- Validation: GroundTruthValidator, validate_predictions
- Statistics: t_test, mann_whitney, trend_analysis, ab_test
- Reporting: ReportGenerator, generate_report

Enables clean imports like:
  from climatevision.analytics import estimate_carbon, generate_report

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iddleware-audit

Merging Olufemi's API middleware and auth modules
…tics-statistics

Merging Francis's analytics statistics and reporting modules
Defines responsibilities, deliverables, and collaboration guidelines for the Carbon Analytics & Validation role.

Co-Authored-By: Francis Umo <francis.umo@climatevision.org>
Defines responsibilities, deliverables, and collaboration guidelines for the API Development & Integration role.

Co-Authored-By: Olufemi Taiwo <olufemi.taiwo@climatevision.org>
…mate-Vision#7)

* feat(data): add GEE tile downloader with analysis-aware band selection

- Downloads real Sentinel-2 composites via Google Earth Engine
- Reads required bands from config.yaml per analysis_type
- Includes SCL band for downstream cloud masking
- Synthetic fallback with explicit is_synthetic flag when GEE unavailable
- Fix .gitignore so src/climatevision/data/ is no longer ignored

* feat(data): add analysis-specific Sentinel-2 band mapping utilities

- get_bands_for_analysis() reads correct bands from config.yaml
- get_band_indices() maps band names to canonical 13-band stack positions
- is_analysis_enabled() and list_enabled_analysis_types() for config validation
- Includes SCL band helpers for downstream cloud masking

* feat(data): integrate SCL cloud masking and export new pipeline modules

- apply_scl_cloud_mask() masks cloudy pixels using Sentinel-2 SCL band
- Default clear labels: vegetation, bare soils, water, snow
- Update __init__.py to expose gee_downloader and band_mapping utilities

* refactor(data): address PR review feedback

- Remove duplicated config logic in gee_downloader.py; import from band_mapping
- Cache config.yaml load in band_mapping.py via lru_cache
- Read synthetic tile size from config.yaml instead of hardcoding 256
- Remove unused json import in gee_downloader.py
- Add shape validation in apply_scl_cloud_mask

---------

Co-authored-by: Adeolu Mary Oshadare <adeolu@placeholder.com>
…ing (Climate-Vision#8)

* feat(inference): make pipeline analysis-aware with dynamic model loading

- _load_model() now accepts analysis_type and reads in_channels/num_classes from config.yaml
- Per-analysis-type model cache prevents cross-contamination between deforestation/ice/flood models
- _find_best_checkpoint() prefers config.yaml weight path per analysis type
- run_inference() accepts analysis_type, pads/crops to correct n_channels, and returns dynamic class counts
- run_inference_from_file() and run_inference_from_gee() propagate analysis_type parameter

* feat(api): wire analysis_type into prediction endpoints

- Pass body.analysis_type to run_inference_from_gee() in /api/predict
- Pass analysis_type to run_inference_from_file() in /api/predict/upload
- Enables the API to load the correct model and return correct class counts per analysis type

---------

Co-authored-by: Olufemi Taiwo <Olufemitaiwo23@gmail.com>
… flag, add config health validation

- Add cv_dev development key bypass for local testing
- Require X-API-Key on all mutation endpoints (POST predict, orgs, alerts, subscriptions)
- Surface is_synthetic at root of inference response for frontend demo banners
- Expand /api/health to validate config alignment (bands vs in_channels, classes vs num_classes)
- Add FastAPI test client fixture
- Create CI workflow for Python (flake8, pytest) and frontend (npm build)
- Bootstrap tests/ directory structure
- Parametrize UNet init for all 3 analysis types (4ch/2cl, 4ch/3cl, 3ch/3cl)
- Validate forward pass output shapes
- Add Siamese change detection forward shape test
- Link to 6 active good-first-issue and help-wanted issues
- Add claim workflow for new contributors
- Include time estimates and skill-building map
- ../components/map/ -> ../components/Map/
- Fixes vite build failure on Linux (case-sensitive filesystem)
- Fixes pip install failure for gdal and rasterio on Ubuntu runners
- Adds libgdal-dev, gdal-bin, libgl1-mesa-glx
- gdal Python package requires exact system GDAL version matching
- rasterio covers all GDAL functionality we actually use
- Simplify CI system deps to libgl1 only (for opencv runtime)
- Fixes ModuleNotFoundError: No module named 'climatevision'
- pip install -e . registers src/ as an importable package
- ForestDataset with DataLoader support
- Training/validation augmentation pipelines
- Synthetic tile generation for demo/fallback mode
- Add DONE/PENDING task list for April 2026 sprint
- Include actual .github/workflows/ci.yml code in role doc
- Update local CI check commands to match current workflow
- ONNXSession: Cached session manager with auto CPU/CUDA provider selection
- run_onnx_inference: Batch inference with latency tracking
- benchmark_onnx_model: Full benchmarking (p50/p95/p99 + FPS)
- export_unet_to_onnx / export_siamese_to_onnx: Dynamic axis, configurable opset
- run_inference_with_fallback: ONNX to PyTorch automatic fallback
- validate_onnx_model: Cross-validation with PyTorch output
- 32 unit tests across 11 test classes
- Graceful skip when torch/onnx not available

Closes Climate-Vision#12

@Goldokpa Goldokpa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial PR! Solid scope and good test coverage. Before this can move forward there are a few real bugs and a few things worth questioning:

Bugs that will break at runtime:

  1. run_onnx_inference warm-up call is malformed. In onnx_runtime.py:

    session.run(session.input_name, {session.input_name: image[:1]})

    ONNXSession.run only takes input_data (one positional arg), and the underlying _session.run expects (output_names, input_feed) where output_names is None or a list — not a string. This will throw on every call. Should just be session.run(image[:1]).

  2. all_outputs.extend(outputs) then all_outputs[0]. session.run() returns a list of arrays (one per output). For a single-output model with one batch, extend flattens to [array], but for multi-batch this stitches outputs from different batches as siblings rather than concatenating along the batch axis. The downstream argmax/softmax then runs only on the first batch's logits and silently drops the rest. Use np.concatenate([out[0] for out in batched_outputs], axis=0) or similar.

  3. Fallback path calls pytorch_inference(image, analysis_type=...) but image is a numpy array. Worth confirming run_inference in pipeline.py accepts that — if it expects a tensor or a file path, the fallback will always fail.

  4. _EXECUTION_PROVIDERS is malformed. The CUDA entry is ("CUDAExecutionProvider", {"device_id": "0"}) — but device_id should be an int, not a string, per the ORT API. Also ort.InferenceSession(providers=...) expects either a list of strings or a list of (name, options_dict) tuples; the code mixes both formats and the selected_providers filter only checks the name, so it passes the malformed tuple through unchanged.

Things that look suspicious / worth challenging:

  1. Test file references an absolute home pathpip install -e /home/fa/projects/climatevision-work in docs/onnx-runtime-guide.md, plus the docstring on onnx_runtime.py and the _DEFAULT_ONNX_DIR = parents[3] (same parents-index issue I'd want verified — file is at src/climatevision/inference/onnx_runtime.py, so parents[3] should be the repo root, that one's actually correct here, but worth confirming for onnx_export.py too). The hardcoded developer path in the docs should be removed.

  2. 873-line test file with 40+ test cases for a new module is unusual. Many tests follow the pattern of try: import onnxruntime; except: pytest.skip() — meaning in CI without onnxruntime installed (which is a new dependency this PR adds), almost every test silently skips. The test for test_session_raises_without_onnxruntime literally has pass # Skip this test as it requires complex mocking inside it, so it asserts nothing. Worth pruning to focused tests that actually run, or pinning onnxruntime as a test dep.

  3. Dependencies aren't added to pyproject.toml / requirements.txt. The PR introduces onnx>=1.14.0 and onnxruntime>=1.15.0 but only mentions them in the markdown doc. Imports will fail in any environment that hasn't been manually prepared.

  4. run_onnx_inference has no path that returns a dict[str, Any] despite the return type annotation ONNXInferenceResult | dict[str, Any]. Either the dict branch is missing or the annotation is wrong.

  5. export_model_from_checkpoint calls torch.load(...) without weights_only=True. Recent PyTorch versions warn on this, and it's a security concern for untrusted checkpoints.

Happy to re-review once these are addressed. The overall architecture (session caching, fallback path, benchmark dataclass) is reasonable — the issues are mostly in the wiring.

@Goldokpa

Copy link
Copy Markdown
Member

👋 Friendly ping, @jshaofa-ui — checking in on the ONNX backend PR. The main blocker remains the warm-up call in run_onnx_inference (the session.run(...) line is malformed and is what's failing CI). Once that's addressed, the rest should follow quickly. Drop a comment if you're stuck and I'll help unblock — I'd like to get this in the next release if we can.

@Goldokpa

Copy link
Copy Markdown
Member

📢 Heads-up: repo history was rewritten today (2026-05-18)

We force-pushed a cleaned history across all branches to remove an internal directory from past commits. Your code and this PR are unaffected — only the commit SHAs underneath have shifted. GitHub will re-render the diff against the new base automatically.

If you have a local clone, please bring it back in sync before pushing anything else:

# Option A (simplest): fresh start
git clone https://github.com/Climate-Vision/ClimateVision.git

# Option B: rebase the existing PR branch in your fork
git fetch origin
git checkout <your-branch>
git rebase origin/main          # likely no conflicts
git push --force-with-lease

Do not git pull on an existing clone — it will produce a messy non-fast-forward state. Either re-clone, or rebase explicitly as above.

Apologies for the interruption — really appreciate your patience here. If anything looks off after rebasing, leave a comment and I'll help unblock right away. Thanks for contributing 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Good First Issue] Add ONNX Runtime inference path with PyTorch fallback

7 participants