Skip to content

Add CQL filter passthrough to OGC waterdata functions#880

Closed
thodson-usgs wants to merge 3 commits intoDOI-USGS:developfrom
thodson-usgs:feat/cql-filter-passthrough
Closed

Add CQL filter passthrough to OGC waterdata functions#880
thodson-usgs wants to merge 3 commits intoDOI-USGS:developfrom
thodson-usgs:feat/cql-filter-passthrough

Conversation

@thodson-usgs
Copy link
Copy Markdown

Summary

Every OGC read_waterdata_* function (read_waterdata_continuous, read_waterdata_daily, read_waterdata_field_measurements, read_waterdata_monitoring_location, read_waterdata_ts_meta, read_waterdata_latest_continuous, read_waterdata_latest_daily, read_waterdata_channel) now accepts filter and filter_lang arguments that are forwarded as the OGC filter / filter-lang query parameters. The R argument filter_lang is translated to the hyphenated filter-lang URL parameter that the service expects (hyphens aren't valid in R argument names).

When a filter is a top-level OR chain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates the results, deduplicated by id. This keeps the common multi-interval use case out of the caller's way — they don't need to know about the server's 414 boundary.

This mirrors the Python companion PR: DOI-USGS/dataretrieval-python#238.

Motivation

The OGC time parameter accepts a single instant, a single bounded interval, or a half-bounded interval — it does not accept a list of intervals. For workflows that need to pull short windows of continuous data around many field-measurement timestamps (e.g., pairing discrete discharge measurements with the index velocity at the time of each measurement), the existing client requires one HTTP round-trip per window.

The waterdata OGC API already supports a filter query parameter with CQL OR-expressions, but this isn't currently exposed through the R client's signatures. This PR threads the passthrough through:

df <- read_waterdata_continuous(
  monitoring_location_id = "USGS-07374525",
  parameter_code = "72255",
  filter = paste0(
    "(time >= '2023-01-06T16:00:00Z' AND time <= '2023-01-06T18:00:00Z') ",
    "OR (time >= '2023-01-10T18:00:00Z' AND time <= '2023-01-10T20:00:00Z')"
  ),
  filter_lang = "cql-text"
)

Long OR chains are handled for the caller:

# 200 windows, ~14 KB of filter text — would be HTTP 414 as a single GET
df <- read_waterdata_continuous(..., filter = paste(many_between_clauses, collapse = " OR "))
# → splits into sub-requests under the hood; results are concatenated
#   and deduplicated by id

Chunking behavior

  • Only top-level OR chains are split. The splitter is paren- and quote-aware, so OR inside sub-expressions like (A OR B) or string literals like 'foo OR bar' is preserved.
  • If the expression has no top-level OR, or any single clause already exceeds the budget, the filter is sent as-is (server decides) rather than being mangled.
  • Per-chunk results are concatenated and deduplicated by the service's output id (continuous_id, daily_id, etc.) so overlapping user-supplied OR clauses combine losslessly.
  • The budget constant (.CQL_FILTER_CHUNK_LEN = 5000) is private and conservative; the continuous endpoint has been observed to return HTTP 414 around ~7 KB of filter text.

Caveats

  • The server currently accepts cql-text (default) and cql-json; cql2-text / cql2-json return 400 Invalid filter language.

Changes

  • R/construct_api_requests.R — translates filter_langfilter-lang URL key and adds filter / filter-lang to the single_params list.
  • R/get_ogc_data.R — adds private split_top_level_or and chunk_cql_or helpers, and fans a long filter into per-chunk sub-requests when needed, concatenating and deduping results by output id.
  • R/read_waterdata_{continuous,daily,field_measurements,monitoring_location,ts_meta,latest_continuous,latest_daily,channel}.R — add filter and filter_lang arguments with documentation.
  • tests/testthat/tests_userFriendly_fxns.R — adds non-network unit tests for the passthrough, hyphenation, splitter/chunker semantics.
  • NEWS — short announcement.

Test plan

  • NOT_CRAN=true API_USGS_PAT=… Rscript -e 'devtools::test()' — 303/303 pass (includes ~9 new tests for filter/filter_lang/split/chunk).
  • Rscript -e 'devtools::check(vignettes = FALSE, args = c("--no-tests", "--no-examples", "--no-manual"))' — 0 errors, 0 warnings, 0 notes related to these changes.
  • Live end-to-end smoke test against a real site with a long OR filter is easy to run but was not rerun for this PR — reviewer should verify against their workflow of interest.

Marked as draft pending maintainer review.

🤖 Generated with Claude Code

Every OGC read_waterdata_* function (continuous, daily, field_measurements,
monitoring_location, ts_meta, latest_continuous, latest_daily, channel) now
accepts `filter` and `filter_lang` arguments that are forwarded as the
OGC `filter` / `filter-lang` query parameters. The R argument `filter_lang`
is translated to the hyphenated `filter-lang` URL parameter that the
service expects.

When a filter is a top-level OR chain that exceeds a conservative
URI-length budget (5 KB), the library transparently splits it into
multiple sub-requests and concatenates (and deduplicates) the results.
This keeps the common multi-interval use case out of the caller's way --
they don't need to know about the server's 414 boundary.

Mirrors dataretrieval-python PR DOI-USGS#238.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Author

@ldecicco-USGS , would you give some high-level feedback on how we could expose waterdata filters through the dataretrieval API. Feel free to review more of the code. It's AI generated, so you might start with a quick pass and we'll use your feedback to steer the bot. Feel free to ask questions here or iterate with your own bot locally. A couple iterations of this might get to a shipable state. Then I'll take that feedback back to the Python implementation.

…rame handling

Addresses feedback on the companion Python PR (DOI-USGS/dataretrieval-python#238):

- Skip chunking when `filter_lang` is not `cql-text`. The splitter is
  text- and single-quote-aware and would corrupt cql-json. Non-cql-text
  filters are now forwarded as-is.
- Budget each chunk against the server's URL byte limit
  (`.WATERDATA_URL_BYTE_LIMIT = 8000`, matching the observed HTTP 414
  cliff of ~8,200 bytes) rather than a fixed raw filter length.
  `effective_filter_budget` probes the non-filter URL, subtracts, and
  converts back to raw CQL bytes using the max per-clause encoding
  ratio (with the " OR " joiner included — in R's percent-encoding the
  joiner inflates 2x, heavier than typical clause ratios, and the
  previous clause-only max let chunks overflow the URL cap).
- When the non-filter URL already exceeds the byte limit, return a
  budget larger than the filter so it passes through unchanged — one
  clear 414 is better feedback than N failing sub-requests.
- Move filter chunking out of the recursive `get_ogc_data` path and
  into the post-transform branch, so the probe sees the real request
  args. Collect raw frames, drop empty ones before `rbind` (a plain
  empty frame first would downgrade a later sf result and drop
  geometry/CRS), and dedup on the pre-rename feature `id`.
- Add regression tests for doubled single-quote CQL escape, the URL
  byte budget guarantee, and non-cql-text pass-through.
- Document CQL filter usage with two examples on
  `read_waterdata_continuous`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the helper organization in the merged Python PR
(DOI-USGS/dataretrieval-python#238) so the per-language
implementations stay easy to read alongside each other.

The single-vs-fanned distinction is now expressed once, in
`plan_filter_chunks`, which always returns a list of "chunk
overrides" -- `list(NULL)` for "send `args` as-is", or a list of
chunked cql-text expressions otherwise. `fetch_chunks` issues one
request per entry and returns the per-chunk frames plus the first
sub-request (for the `request` attribute). `combine_chunk_frames`
handles the empty-frame and dedup-by-`id` cases.

`get_ogc_data` is now a linear pipeline:

    chunks   <- plan_filter_chunks(args)
    fetched  <- fetch_chunks(args, chunks)
    return_list <- combine_chunk_frames(fetched$frames)
    req      <- fetched$req
    ... post-processing ...

Behavior unchanged: same chunk sizing (URL-byte-budget aware),
same cql-text-only guard, same empty-frame and id-dedup handling.
The only observable difference is that the `request` attribute
now points at the first sub-request instead of the last (matching
Python's choice of representative metadata), which is a
debugging-only change for the chunked path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ldecicco-USGS
Copy link
Copy Markdown
Collaborator

There's a ton of overlap between this and #879
For most of the data-centric functions (daily, continuous, field_measurements) - I don't think the "filter" argument would add anything. For instance, all the time arguments are already really flexible. The rest of the properties are characters so you can't do things like filter=value>=1000 (since value is a character).

I'll consider how the filter argument could be used in monitoring_locations, ts_meta, combine... but again because the chunking is being added to the other PR we're waiting on - I'm going to close this one.

@thodson-usgs
Copy link
Copy Markdown
Author

thodson-usgs commented Apr 24, 2026

I didn't think the the time arguments would accept multi-window requests, which the filter allows (e.g., get_nearest_continuous sort of requests). Is that wrong?

@ldecicco-USGS
Copy link
Copy Markdown
Collaborator

So this was my though process:
If I saw a generic "filter" argument, my first thought would be SWEET, I want to answer all these interesting questions about the data. So I'll set `filter=value >1000'. The problem with that is it works, but since value is a character, it's filtering all the values that are alphabetically above "1000" (like "12"). Every other property except (again...in the data functions, I'm still debating the metadata type functions) are also characters and will have the same issue.

My gut says there would be way more people trying stuff like than and then either get the wrong results unknowingly, or complain that dataRetrieval is broken versus those would would use the filter to time windows.

Users can pass custom CQL2 into the read_waterdata function like this example:

# A wildcard in CQL2 is %
# Here's how to get HUCs that fall within 02070010
cql_huc_wildcard <- '{
"op": "like",
"args": [
  { "property": "hydrologic_unit_code" },
  "02070010%"
]
}'

what_huc_sites <- read_waterdata(service = "monitoring-locations",
                                 CQL = cql_huc_wildcard)

So we're not prohibiting a complex time window.

I'm more inclined to set up an article, or expanding this one:
https://doi-usgs.github.io/dataRetrieval/articles/join_by_closest.html
that show techniques for users options for how do the same work you are proposing...but I'll mull it around more, probably by updating that article and see where the most pain lies first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants