Skip to content

Add waterdata.get_nearest_continuous helper#239

Merged
thodson-usgs merged 5 commits intoDOI-USGS:mainfrom
thodson-usgs:add-get-nearest-continuous
Apr 23, 2026
Merged

Add waterdata.get_nearest_continuous helper#239
thodson-usgs merged 5 commits intoDOI-USGS:mainfrom
thodson-usgs:add-get-nearest-continuous

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented Apr 23, 2026

Summary

Adds waterdata.get_nearest_continuous(targets, ...) — for each target timestamp, returns the single continuous observation closest to that timestamp, fetched in one HTTP round-trip (auto-chunked when the CQL filter gets long).

Why

The Water Data API's time= parameter treats a single instant as an exact match, not a nearest-match — time=2023-06-15T10:30:31Z on a 15-minute gauge returns 0 rows. The advertised sortby parameter would make "nearest" expressible as filter=time <= 'target' & sortby=-time & limit=1, but sortby is per-query, so N targets would mean N HTTP round-trips. There is no T_NEAREST CQL function either.

The narrow-window + client-side reduction implemented here is the one pattern that folds N targets into a single request today.

Usage

import pandas as pd
from dataretrieval import waterdata

targets = pd.to_datetime([
    "2023-06-15T10:30:31Z",
    "2023-06-15T14:07:12Z",
    "2023-06-16T03:45:19Z",
])

df, md = waterdata.get_nearest_continuous(
    targets,
    monitoring_location_id="USGS-02238500",
    parameter_code="00060",
)

# df: one row per target, augmented with a `target_time` column
# identifying which target each row corresponds to.

Knobs:

  • window="7min30s" — half-window around each target. The default matches the 15-minute continuous cadence so most windows contain exactly one observation. Widen (e.g. "15min", "30min") for irregular cadences or resilience to data gaps.
  • on_tie="first" — how to resolve ties when a target falls at the midpoint between two grid points (rare but possible). Alternatives: "last" (keep the later observation), "mean" (average numeric columns; set time to the target).

Multi-site calls return one row per (target, monitoring_location_id) pair. Targets with no observations in their window are silently dropped. Passing time=, filter=, or filter_lang= raises TypeError — the helper builds those itself.

Reproduce the live result in one shot

The exact URL the helper generated from the 5-target example in the test plan below — paste into a browser or curl (no API key required; get one here if you want higher rate limits):

https://api.waterdata.usgs.gov/ogcapi/v0/collections/continuous/items?monitoring_location_id=USGS-02238500&parameter_code=00060&filter=%28time+%3E%3D+%272023-06-15T10%3A23%3A01Z%27+AND+time+%3C%3D+%272023-06-15T10%3A38%3A01Z%27%29+OR+%28time+%3E%3D+%272023-06-15T13%3A59%3A42Z%27+AND+time+%3C%3D+%272023-06-15T14%3A14%3A42Z%27%29+OR+%28time+%3E%3D+%272023-06-16T03%3A37%3A49Z%27+AND+time+%3C%3D+%272023-06-16T03%3A52%3A49Z%27%29+OR+%28time+%3E%3D+%272023-06-16T18%3A52%3A15Z%27+AND+time+%3C%3D+%272023-06-16T19%3A07%3A15Z%27%29+OR+%28time+%3E%3D+%272023-06-17T06%3A06%3A32Z%27+AND+time+%3C%3D+%272023-06-17T06%3A21%3A32Z%27%29&skipGeometry=False&limit=50000&filter-lang=cql-text

Decoded the filter is 5 OR'd ±7min30s windows around each target timestamp. The response should contain one feature per target, all from gauge USGS-02238500 with parameter_code 00060, values ≈ 22.4 ft³/s.

Relationship to #238

This PR is built on top of #238 (Add CQL filter passthrough to OGC waterdata getters) and will look lighter once that lands. The helper's core trick — fanning N targets into one request — is only possible because #238 adds filter= support + automatic URL-length-safe chunking to get_continuous. The branch add-get-nearest-continuous is stacked on add-ogc-cql-filter-passthrough, so until #238 merges the diff here will include both changesets; after #238 merges the commits on its branch become common ancestors and this PR's diff reduces to the one commit introducing get_nearest_continuous and its tests.

Please merge #238 first.

Test plan

  • ruff check / format pass.
  • pytest tests/waterdata_nearest_test.py — 14/14 pass. Covers basic reduction, CQL filter shape, all three tie modes, missing-window drop, multi-site fan-out, empty targets, kwarg validation, and **kwargs forwarding.
  • Full non-live suite — 111/111 pass.
  • Live end-to-end against USGS-02238500 00060 with 5 off-grid targets (10:30:31, 14:07:12, 03:45:19, 18:59:45, 06:14:02) — one request, five ±7min30s windows OR'd together, five rows returned with deltas −432s to +58s (all inside the half-window). See the URL above.

🤖 Generated with Claude Code

@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

@mikemahoney218-usgs,

I vibe coded this function to return the nearest continuous data given a list of timestamps, which just constructs the filter and passes it through the waterdata api. @ldecicco-USGS suggested this feature could be upstreamed as a new endpoint. How do you feel about that?

thodson-usgs and others added 4 commits April 23, 2026 15:23
For each target timestamp, fetch the nearest continuous observation in
a single round trip. Builds a CQL OR-chain of per-target bracketed
windows, pipes it through ``get_continuous`` (which already auto-chunks
long filters across multiple sub-requests), then selects the single
observation closest to each target client-side.

Exists because the Water Data API matches a single-instant ``time=``
parameter exactly (10:30:31 returns zero rows on a 15-minute gauge),
does not implement ``sortby`` for arbitrary queryables, and does not
expose a ``T_NEAREST`` CQL function. The narrow-window + client-side
reduction is the one pattern that works today for multi-target
nearest lookups in one API call.

Tie handling is configurable via ``on_tie``:
  - "first" (default): keep the earlier observation
  - "last":  keep the later observation
  - "mean":  average numeric columns; set ``time`` to the target

Default ``window="7min30s"`` matches the 15-minute gauge cadence so
most targets' windows contain exactly one observation. Users with
irregular-cadence gauges or known data gaps can widen to "15min" or
"30min" at the cost of more bytes per response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``pandas.Timedelta`` already accepts ``"00:07:30"`` identically to
``"7min30s"``, so no behaviour change is needed — just switch the
default and the docstring examples to the more readable form. Added a
regression test that asserts the two spellings produce the same CQL
filter so future refactors can't drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the default from "00:07:30" to "PT7M30S" so the user-visible
contract points at an actual international standard (ISO 8601
duration) rather than a pandas-specific colon form.

``pandas.Timedelta`` still accepts all the other forms users may
already have typed — ISO 8601, HH:MM:SS, shorthand ("7min30s",
"450s"), or a ``pd.Timedelta`` directly — and a parametrized test
now exercises each shape to lock in the "whatever ``pd.Timedelta``
takes" contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Split the helper's body into four private functions so the top-level
flow reads as a short recipe:

  - ``_check_nearest_kwargs``      reject kwargs the helper owns
                                   (``time``/``filter``/``filter_lang``);
                                   validate ``on_tie``
  - ``_build_window_or_filter``    CQL ``OR``-chain of bracketed time
                                   windows, one per target
  - ``_pick_nearest_row``          window → nearest row, with the three
                                   tie-resolution branches isolated
  - ``_empty_nearest_result``      empty frame with a ``target_time``
                                   column, used wherever no match lands

Drops the nested ``for site → for target → mask → tie-branch`` loop in
favor of a flat list-comprehension + walrus against the new helper.
Fixes a fragile ``pd.to_datetime(list(targets), utc=True)`` (a numpy
``datetime64`` array would round-trip through ``list`` as tz-stripped
scalars) — now passes the input directly to ``pd.to_datetime`` and
wraps in ``pd.DatetimeIndex``. Swaps ``df = df.copy(); df["time"] =
...`` for ``df.assign(time=...)`` to avoid the full-frame copy.

Also NEWS.md: add a short entry describing the new helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the add-get-nearest-continuous branch from b20d08e to 2613792 Compare April 23, 2026 20:25
Calling ``get_nearest_continuous`` with an empty ``targets`` is
almost always a caller bug (an unfiltered frame, a typo, a mis-named
column). The previous code papered over it by firing a trivial-range
HTTP request (``time=1900-01-01/1900-01-01``) purely so the caller
received a real ``BaseMetadata`` object. That pattern wastes a
round-trip on a nonsensical input and hides the bug.

Raise ``ValueError`` on empty ``targets`` instead. Shrinks the body
and makes a caller mistake loud.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs merged commit 33e928d into DOI-USGS:main Apr 23, 2026
8 checks passed
@thodson-usgs thodson-usgs deleted the add-get-nearest-continuous branch April 23, 2026 20:42
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request Apr 24, 2026
For each target timestamp, fetch the nearest continuous observation in
a single round trip. Builds a CQL OR-chain of per-target bracketed
windows, pipes it through ``get_continuous`` (which already auto-chunks
long filters across multiple sub-requests), then selects the single
observation closest to each target client-side.

Exists because the Water Data API matches a single-instant ``time=``
parameter exactly (10:30:31 returns zero rows on a 15-minute gauge),
does not implement ``sortby`` for arbitrary queryables, and does not
expose a ``T_NEAREST`` CQL function. The narrow-window + client-side
reduction is the one pattern that works today for multi-target
nearest lookups in one API call.

Tie handling is configurable via ``on_tie``:
  - "first" (default): keep the earlier observation
  - "last":  keep the later observation
  - "mean":  average numeric columns; set ``time`` to the target
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant