Check for silent lexicographic comparison against string-typed value#240
Draft
thodson-usgs wants to merge 5 commits intoDOI-USGS:mainfrom
Draft
Check for silent lexicographic comparison against string-typed value#240thodson-usgs wants to merge 5 commits intoDOI-USGS:mainfrom
value#240thodson-usgs wants to merge 5 commits intoDOI-USGS:mainfrom
Conversation
Add ``filter`` and ``filter_lang`` kwargs to the OGC ``waterdata`` getters (``get_continuous``, ``get_daily``, ``get_monitoring_locations``, ``get_time_series_metadata``, ``get_latest_continuous``, ``get_latest_daily``, ``get_field_measurements``, ``get_channel``, and the other collections built on the same plumbing). The value is forwarded verbatim to the server's OGC API ``filter`` query parameter, letting callers use server-side CQL expressions that aren't expressible via the other kwargs — most commonly, OR'ing several disjoint time ranges into a single request. - A new ``FILTER_LANG = Literal["cql-text", "cql-json"]`` type alias is added in ``waterdata.types`` and re-exported from the package. The server accepts ``cql-text`` (default) and ``cql-json`` today; the ``cql2-*`` dialects are not yet supported. - A long top-level ``OR`` chain is transparently split into multiple sub-requests that each fit under the server's URI length limit, and the results are concatenated. Filters without a top-level ``OR`` are sent as a single request unchanged. - ``get_continuous`` gains docstring examples showing both the simple two-window case and the programmatically-built many-window case that exercises the auto-chunking path. - NEWS.md gains a v1.1.0 highlights block covering this change along with the other user-visible additions since release. - ``tests/waterdata_utils_test.py`` grows coverage for filter forwarding, the OR-chunking paths, and error handling.
For each target timestamp, fetch the nearest continuous observation in a single round trip. Builds a CQL OR-chain of per-target bracketed windows, pipes it through ``get_continuous`` (which already auto-chunks long filters across multiple sub-requests), then selects the single observation closest to each target client-side. Exists because the Water Data API matches a single-instant ``time=`` parameter exactly (10:30:31 returns zero rows on a 15-minute gauge), does not implement ``sortby`` for arbitrary queryables, and does not expose a ``T_NEAREST`` CQL function. The narrow-window + client-side reduction is the one pattern that works today for multi-target nearest lookups in one API call. Tie handling is configurable via ``on_tie``: - "first" (default): keep the earlier observation - "last": keep the later observation - "mean": average numeric columns; set ``time`` to the target
559b466 to
68d1a81
Compare
``get_nearest_continuous`` is built entirely on top of the CQL
``filter`` passthrough (it constructs an OR-chain and calls
``get_continuous(..., filter=..., filter_lang='cql-text')``), so it
has no meaning without the filter feature. If the filter feature is
ever rolled back, this function rolls back with it.
Putting the function + its four private helpers in a dedicated
``dataretrieval/waterdata/nearest.py`` sibling of ``filters.py``
makes that linked rollback clean:
rm filters.py nearest.py
rm tests/waterdata_filters_test.py tests/waterdata_nearest_test.py
plus trimming two re-export lines in ``__init__.py`` and the filter
kwargs in ``api.py``. Versus the previous layout, that replaces
"find and excise a 232-line cluster across two blocks inside
``api.py``" with "delete a file."
Same public surface (``waterdata.get_nearest_continuous`` still
imported from the package root); mock paths in
``tests/waterdata_nearest_test.py`` updated to ``dataretrieval
.waterdata.nearest.get_continuous`` since ``nearest.py`` does
``from .api import get_continuous`` and mocks need to patch where
the name is looked up.
All 176 non-live tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running the rollback proposal literally against a disposable working tree showed the documented "rollback surface" in ``filters.py`` understated the real cost by ~4× — specifically, api.py was carrying ~120 lines of filter-specific docstring text that would also have to be removed. Three changes to close the gap: - Replace the 8 × 11-line ``filter`` / ``filter_lang`` kwarg docstring stanzas with 5-line pointers to ``dataretrieval.waterdata.filters`` (one "see there for grammar / chunking / pitfall"). Users clicking through their IDE still get a useful reference; the authoritative prose lives in one place (the filters module docstring) rather than being duplicated eight times. Saves ~48 lines of api.py and makes each stanza disposable in 5 lines flat. - Trim ``get_continuous``'s inline filter examples from two extended blocks (34 lines, showing both a hand-rolled OR pair and a programmatic chain) to a single compact example (14 lines, the two-window case). The second example was pedagogically redundant with what the filters module docstring covers. Saves ~20 lines. - Update the "Isolation contract" in ``filters.py``'s module docstring to accurately enumerate the rollback surface: include ``nearest.py`` + its test file, the ``__all__`` entries in ``__init__.py``, the docstring pointers (now trimmed), and the one remaining filter example. Acknowledges the two-line ``filter_lang`` → ``filter-lang`` URL translation as optional dead-code removal. No behaviour change. All 176 non-live tests still pass. Measured rollback surface after this commit: ~70 lines of surgery across 3 source files + 4 file deletions, vs. ~150 lines before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary (draft — for discussion)
Every queryable field on every OGC collection in the Water Data API is
type=stringserver-side. So any unquoted numeric comparison in a CQL-text filter —value >= 1000,parameter_code = 60,district_code = 1— isn't a valid numeric comparison: empirically the server returns HTTP 500 Internal Server Error. Even if the user does quote the literal, the comparison is lexicographic (e.g.value > '10'returns rows wherevalue='34.52'because first-char'3'>'1'), and zero-padded codes likeparameter_code = '60'silently match nothing because the real values are'00060'-shaped.Either way, the user's intent was numeric, they get either 500 or silent wrong rows, and the failure is opaque. This PR catches the pattern client-side and raises a clear
ValueErrorbefore the request fires.Motivation
Flagged during review of DOI-USGS/dataRetrieval#880 (R side), where
ldecicco-USGSpushed back on exposing a genericfilterkwarg:This is the smallest change that addresses her concern without removing the
filterfeature.What the check does
Runs once per cql-text filter, inside
_plan_filter_chunks, before any HTTP traffic:Scope: universal, not a watchlist
Every queryable property across every OGC endpoint is
type=string— confirmed empirically:continuousvalue,parameter_code,statistic_iddailyvalue,parameter_code,statistic_idfield-measurementsvalue,parameter_codelatest-continuousvalue,parameter_code,statistic_idlatest-dailyvalue,parameter_code,statistic_idtime-series-metadataparameter_code,statistic_id,hydrologic_unit_codemonitoring-locationsmonitoring_location_number,district_code,state_code,county_codechannel-measurementsmeasurement_number,channel_flow,channel_width,channel_area,channel_velocitySince there's no such thing as a legitimate numeric comparison on this API, the regex flags any
<identifier> <op> <unquoted numeric literal>(or the reverse), regardless of field. Quoted literals (value >= '1000') are not flagged — the caller has signalled they want sort-order semantics.Live evidence
Test plan
ruff check/ruff format --checkpass.pytest tests/waterdata_utils_test.py— 66/66 pass (32 prior + 34 new). The new tests cover:>=,>,<=,<,=,!=) × both orderings (x OP NandN OP x), floats, negatives, multiple real-world fields (value, parameter_code, statistic_id, district_code, county_code, hydrologic_unit_code, channel_flow, channel_velocity), and nested in AND/OR expressions.INlists, and false-positive guards (identifiers appearing only inside quoted string literals likename = 'see district_code = 1 in docs').get_continuousbefore_construct_api_requestsis ever called (mock-verified).USGS-02238500/continuous— unquoted numeric RHS consistently returns 500; quoted literal returns 200 with lex-sorted results.Open for discussion
_plan_filter_chunksalongside the other filter validation (chunkability, URL-budget). Could hoist to theget_ogc_dataentry.CAST(value AS FLOAT) > 1000): the regex is scoped to simple\b<ident>\b \s* <op> \s* <num>patterns and — empirically — the server doesn't support CAST or CQL2 functions on these endpoints anyway, so the false-positive rate should be near zero in practice.Marked draft so it can ride along with the R-side discussion before landing.
🤖 Generated with Claude Code