Add CQL filter passthrough to OGC waterdata functions by thodson-usgs · Pull Request #880 · DOI-USGS/dataRetrieval

thodson-usgs · 2026-04-22T20:22:35Z

Summary

Every OGC read_waterdata_* function (read_waterdata_continuous, read_waterdata_daily, read_waterdata_field_measurements, read_waterdata_monitoring_location, read_waterdata_ts_meta, read_waterdata_latest_continuous, read_waterdata_latest_daily, read_waterdata_channel) now accepts filter and filter_lang arguments that are forwarded as the OGC filter / filter-lang query parameters. The R argument filter_lang is translated to the hyphenated filter-lang URL parameter that the service expects (hyphens aren't valid in R argument names).

When a filter is a top-level OR chain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates the results, deduplicated by id. This keeps the common multi-interval use case out of the caller's way — they don't need to know about the server's 414 boundary.

This mirrors the Python companion PR: DOI-USGS/dataretrieval-python#238.

Motivation

The OGC time parameter accepts a single instant, a single bounded interval, or a half-bounded interval — it does not accept a list of intervals. For workflows that need to pull short windows of continuous data around many field-measurement timestamps (e.g., pairing discrete discharge measurements with the index velocity at the time of each measurement), the existing client requires one HTTP round-trip per window.

The waterdata OGC API already supports a filter query parameter with CQL OR-expressions, but this isn't currently exposed through the R client's signatures. This PR threads the passthrough through:

df <- read_waterdata_continuous(
  monitoring_location_id = "USGS-07374525",
  parameter_code = "72255",
  filter = paste0(
    "(time >= '2023-01-06T16:00:00Z' AND time <= '2023-01-06T18:00:00Z') ",
    "OR (time >= '2023-01-10T18:00:00Z' AND time <= '2023-01-10T20:00:00Z')"
  ),
  filter_lang = "cql-text"
)

Long OR chains are handled for the caller:

# 200 windows, ~14 KB of filter text — would be HTTP 414 as a single GET
df <- read_waterdata_continuous(..., filter = paste(many_between_clauses, collapse = " OR "))
# → splits into sub-requests under the hood; results are concatenated
#   and deduplicated by id

Chunking behavior

Only top-level OR chains are split. The splitter is paren- and quote-aware, so OR inside sub-expressions like (A OR B) or string literals like 'foo OR bar' is preserved.
If the expression has no top-level OR, or any single clause already exceeds the budget, the filter is sent as-is (server decides) rather than being mangled.
Per-chunk results are concatenated and deduplicated by the service's output id (continuous_id, daily_id, etc.) so overlapping user-supplied OR clauses combine losslessly.
The budget constant (.CQL_FILTER_CHUNK_LEN = 5000) is private and conservative; the continuous endpoint has been observed to return HTTP 414 around ~7 KB of filter text.

Caveats

The server currently accepts cql-text (default) and cql-json; cql2-text / cql2-json return 400 Invalid filter language.

Changes

R/construct_api_requests.R — translates filter_lang → filter-lang URL key and adds filter / filter-lang to the single_params list.
R/get_ogc_data.R — adds private split_top_level_or and chunk_cql_or helpers, and fans a long filter into per-chunk sub-requests when needed, concatenating and deduping results by output id.
R/read_waterdata_{continuous,daily,field_measurements,monitoring_location,ts_meta,latest_continuous,latest_daily,channel}.R — add filter and filter_lang arguments with documentation.
tests/testthat/tests_userFriendly_fxns.R — adds non-network unit tests for the passthrough, hyphenation, splitter/chunker semantics.
NEWS — short announcement.

Test plan

NOT_CRAN=true API_USGS_PAT=… Rscript -e 'devtools::test()' — 303/303 pass (includes ~9 new tests for filter/filter_lang/split/chunk).
Rscript -e 'devtools::check(vignettes = FALSE, args = c("--no-tests", "--no-examples", "--no-manual"))' — 0 errors, 0 warnings, 0 notes related to these changes.
Live end-to-end smoke test against a real site with a long OR filter is easy to run but was not rerun for this PR — reviewer should verify against their workflow of interest.

Marked as draft pending maintainer review.

🤖 Generated with Claude Code

Every OGC read_waterdata_* function (continuous, daily, field_measurements, monitoring_location, ts_meta, latest_continuous, latest_daily, channel) now accepts `filter` and `filter_lang` arguments that are forwarded as the OGC `filter` / `filter-lang` query parameters. The R argument `filter_lang` is translated to the hyphenated `filter-lang` URL parameter that the service expects. When a filter is a top-level OR chain that exceeds a conservative URI-length budget (5 KB), the library transparently splits it into multiple sub-requests and concatenates (and deduplicates) the results. This keeps the common multi-interval use case out of the caller's way -- they don't need to know about the server's 414 boundary. Mirrors dataretrieval-python PR DOI-USGS#238. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs · 2026-04-22T20:32:37Z

@ldecicco-USGS , would you give some high-level feedback on how we could expose waterdata filters through the dataretrieval API. Feel free to review more of the code. It's AI generated, so you might start with a quick pass and we'll use your feedback to steer the bot. Feel free to ask questions here or iterate with your own bot locally. A couple iterations of this might get to a shipable state. Then I'll take that feedback back to the Python implementation.

…rame handling Addresses feedback on the companion Python PR (DOI-USGS/dataretrieval-python#238): - Skip chunking when `filter_lang` is not `cql-text`. The splitter is text- and single-quote-aware and would corrupt cql-json. Non-cql-text filters are now forwarded as-is. - Budget each chunk against the server's URL byte limit (`.WATERDATA_URL_BYTE_LIMIT = 8000`, matching the observed HTTP 414 cliff of ~8,200 bytes) rather than a fixed raw filter length. `effective_filter_budget` probes the non-filter URL, subtracts, and converts back to raw CQL bytes using the max per-clause encoding ratio (with the " OR " joiner included — in R's percent-encoding the joiner inflates 2x, heavier than typical clause ratios, and the previous clause-only max let chunks overflow the URL cap). - When the non-filter URL already exceeds the byte limit, return a budget larger than the filter so it passes through unchanged — one clear 414 is better feedback than N failing sub-requests. - Move filter chunking out of the recursive `get_ogc_data` path and into the post-transform branch, so the probe sees the real request args. Collect raw frames, drop empty ones before `rbind` (a plain empty frame first would downgrade a later sf result and drop geometry/CRS), and dedup on the pre-rename feature `id`. - Add regression tests for doubled single-quote CQL escape, the URL byte budget guarantee, and non-cql-text pass-through. - Document CQL filter usage with two examples on `read_waterdata_continuous`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the helper organization in the merged Python PR (DOI-USGS/dataretrieval-python#238) so the per-language implementations stay easy to read alongside each other. The single-vs-fanned distinction is now expressed once, in `plan_filter_chunks`, which always returns a list of "chunk overrides" -- `list(NULL)` for "send `args` as-is", or a list of chunked cql-text expressions otherwise. `fetch_chunks` issues one request per entry and returns the per-chunk frames plus the first sub-request (for the `request` attribute). `combine_chunk_frames` handles the empty-frame and dedup-by-`id` cases. `get_ogc_data` is now a linear pipeline: chunks <- plan_filter_chunks(args) fetched <- fetch_chunks(args, chunks) return_list <- combine_chunk_frames(fetched$frames) req <- fetched$req ... post-processing ... Behavior unchanged: same chunk sizing (URL-byte-budget aware), same cql-text-only guard, same empty-frame and id-dedup handling. The only observable difference is that the `request` attribute now points at the first sub-request instead of the last (matching Python's choice of representative metadata), which is a debugging-only change for the chunked path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ldecicco-USGS · 2026-04-23T21:02:25Z

There's a ton of overlap between this and #879
For most of the data-centric functions (daily, continuous, field_measurements) - I don't think the "filter" argument would add anything. For instance, all the time arguments are already really flexible. The rest of the properties are characters so you can't do things like filter=value>=1000 (since value is a character).

I'll consider how the filter argument could be used in monitoring_locations, ts_meta, combine... but again because the chunking is being added to the other PR we're waiting on - I'm going to close this one.

thodson-usgs · 2026-04-24T02:59:40Z

I didn't think the the time arguments would accept multi-window requests, which the filter allows (e.g., get_nearest_continuous sort of requests). Is that wrong?

ldecicco-USGS · 2026-04-24T13:33:29Z

So this was my though process:
If I saw a generic "filter" argument, my first thought would be SWEET, I want to answer all these interesting questions about the data. So I'll set `filter=value >1000'. The problem with that is it works, but since value is a character, it's filtering all the values that are alphabetically above "1000" (like "12"). Every other property except (again...in the data functions, I'm still debating the metadata type functions) are also characters and will have the same issue.

My gut says there would be way more people trying stuff like than and then either get the wrong results unknowingly, or complain that dataRetrieval is broken versus those would would use the filter to time windows.

Users can pass custom CQL2 into the read_waterdata function like this example:

# A wildcard in CQL2 is %
# Here's how to get HUCs that fall within 02070010
cql_huc_wildcard <- '{
"op": "like",
"args": [
  { "property": "hydrologic_unit_code" },
  "02070010%"
]
}'

what_huc_sites <- read_waterdata(service = "monitoring-locations",
                                 CQL = cql_huc_wildcard)

So we're not prohibiting a complex time window.

I'm more inclined to set up an article, or expanding this one:
https://doi-usgs.github.io/dataRetrieval/articles/join_by_closest.html
that show techniques for users options for how do the same work you are proposing...but I'll mull it around more, probably by updating that article and see where the most pain lies first.

thodson-usgs temporarily deployed to CI_config April 22, 2026 20:22 — with GitHub Actions Inactive

thodson-usgs temporarily deployed to CI_config April 23, 2026 14:38 — with GitHub Actions Inactive

thodson-usgs mentioned this pull request Apr 23, 2026

Add read_waterdata_nearest_continuous helper #881

Draft

3 tasks

thodson-usgs temporarily deployed to CI_config April 23, 2026 20:45 — with GitHub Actions Inactive

ldecicco-USGS closed this Apr 23, 2026

thodson-usgs mentioned this pull request Apr 24, 2026

Check for silent lexicographic comparison against string-typed value DOI-USGS/dataretrieval-python#240

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CQL filter passthrough to OGC waterdata functions#880

Add CQL filter passthrough to OGC waterdata functions#880
thodson-usgs wants to merge 3 commits intoDOI-USGS:developfrom
thodson-usgs:feat/cql-filter-passthrough

thodson-usgs commented Apr 22, 2026

Uh oh!

thodson-usgs commented Apr 22, 2026

Uh oh!

ldecicco-USGS commented Apr 23, 2026

Uh oh!

thodson-usgs commented Apr 24, 2026 •

edited

Loading

Uh oh!

ldecicco-USGS commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thodson-usgs commented Apr 22, 2026

Summary

Motivation

Chunking behavior

Caveats

Changes

Test plan

Uh oh!

thodson-usgs commented Apr 22, 2026

Uh oh!

ldecicco-USGS commented Apr 23, 2026

Uh oh!

thodson-usgs commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldecicco-USGS commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thodson-usgs commented Apr 24, 2026 •

edited

Loading