Skip to content

Support BigQuery nested STRUCT fields in anomaly tests#1012

Open
tlangton3 wants to merge 2 commits into
elementary-data:masterfrom
tlangton3:bigquery-nested-struct-support
Open

Support BigQuery nested STRUCT fields in anomaly tests#1012
tlangton3 wants to merge 2 commits into
elementary-data:masterfrom
tlangton3:bigquery-nested-struct-support

Conversation

@tlangton3
Copy link
Copy Markdown

@tlangton3 tlangton3 commented May 22, 2026

Allows column_anomalies and dimension_anomalies to reference nested STRUCT leaves on BigQuery (e.g. user.address.city) instead of only top-level columns.

A single column-discovery wrapper segment-quotes nested references (`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED ancestors are out of scope (would require UNNEST). test_all_columns_anomalies is unchanged — users opt in by passing column_name=user.address.city explicitly to avoid ballooning the test surface on wide STRUCT schemas.

What changes

  • get_column_obj_and_monitors flattens BigQuery STRUCT columns via BigQueryColumn.flatten() and wraps each discovered column with a dict carrying .name (dotted display form), .quoted (segment-quoted SQL ref), and .safe_alias (dot-free identifier). Top-level STRUCTs are kept alongside their leaves so existing column_name=user behaviour is preserved.
  • column_monitoring_query projects the monitored column as <quoted> as <safe_alias> and references the alias in metric aggregates. select_dimensions_columns applies the same pattern to nested dimensions.
  • dimension_monitoring_query segment-quotes dimension expressions before they are concatenated into dimension_value.

Why two representations

BigQueryColumn.quoted wraps the whole string in one set of backticks, so a flattened nested column's .quoted is `user.address.city` — which BigQuery treats as a single column literally named user.address.city. Even with correct segment-quoting, projecting select user.address.city from t into a CTE without an alias names the resulting column city, losing the path. The wrapper exposes both .quoted (segment-quoted source ref) and .safe_alias (dot-free CTE alias) so the projection-alias pattern composes cleanly and downstream macros stay nesting-agnostic.

Testing

Local validation via dbt parse and a run-operation harness confirmed every SQL fingerprint:

  • Segment-quoting: user.address.city`user`.`address`.`city`
  • Projection: select `user`.`address`.`city` as user__address__city from t
  • Downstream aggregate: coalesce(sum(case when user__address__city is null then 1 else 0 end), 0) as null_count
  • Stored column_name: user.address.city (dotted display preserved for alerts)
  • get_column_data_type BigQuery dispatch works on the wrapped dict via subscript access
  • Non-nested columns / non-BigQuery: byte-equivalent compiled SQL to current behaviour

End-to-end execution against BigQuery to follow.

Summary by CodeRabbit

  • Bug Fixes

    • Better handling of nested/struct fields in BigQuery so monitors correctly detect and report on dotted/nested column leaf values.
    • Safer column and dimension aliasing to avoid invalid identifiers in monitoring outputs.
  • Refactor

    • Reworked monitor selection and dimension concatenation logic for more reliable results with structured data types and complex naming.

Review Change Stack

Allows column_anomalies and dimension_anomalies to reference nested STRUCT
leaves on BigQuery (e.g. user.address.city) instead of only top-level
columns.

A single column-discovery wrapper segment-quotes nested references
(`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias
so the path survives into downstream aggregates. Non-nested columns and
non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED
ancestors are out of scope (would require UNNEST).
test_all_columns_anomalies is unchanged - users opt in by passing
column_name=user.address.city explicitly to avoid ballooning the test
surface on wide STRUCT schemas.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fef8e3dc-7d12-4c67-a11f-6a460ae68a44

📥 Commits

Reviewing files that changed from the base of the PR and between d45a775 and 8c7b36e.

📒 Files selected for processing (2)
  • macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql
  • macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql

📝 Walkthrough

Walkthrough

Adds BigQuery-safe segment quoting, dot-free aliasing, and struct-wrapping helpers, then applies them across column monitor selection, the column monitoring query (projections and metric expressions), and dimension concatenation/bucketing logic.

Changes

BigQuery Nested Field Support via Safe Aliasing

Layer / File(s) Summary
Helper macros for safe BigQuery column handling
macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
Adds bq_segment_quote, bq_safe_alias, wrap_column_for_struct_support, plus bq_safe_leaf_names and _bq_walk_collect for STRUCT leaf discovery; overhauls select_dimensions_columns to segment-quote sources and generate dot-free alias suffixes for nested fields.
Column monitoring query integration
macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
column_monitoring_query now projects monitored columns using column_obj.safe_alias and uses that alias for metric expressions; prefixed_dimensions builds "dimension_*" aliases with bq_safe_alias().
Column monitor configuration wrapping
macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql
get_column_obj_and_monitors and get_all_column_obj_and_monitors wrap column_obj via wrap_column_for_struct_support before deriving data types and selecting monitors; returned column values are the wrapped objects.
Dimension monitoring query updates
macros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql
Builds concatenated dimension expressions using bq_segment_quote per segment; enforces having sum(metric_value) > 0 for training_set_dimensions; adjusts dimensions_buckets join and row-count hydration to the new structure and removes several inline comments.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through dotted fields with glee,
Quoted each segment so queries run free,
Dots turned to underscores, tidy and bright,
Wrapped structs now yield metrics just right,
A small rabbit cheer for safer SQL tonight!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Support BigQuery nested STRUCT fields in anomaly tests' clearly and directly summarizes the main change: enabling nested STRUCT field support in anomaly tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

👋 @tlangton3
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

@tlangton3 tlangton3 requested a deployment to elementary_test_env May 22, 2026 10:36 — with GitHub Actions Waiting
@tlangton3
Copy link
Copy Markdown
Author

End-to-end validated against a real BigQuery dataset.

  • column_anomalies on a three-level nested STRUCT field (<parent>.<intermediate>.<leaf>) compiles with segment-quoted SQL, executes against real data, and writes a row to data_monitoring_metrics with the dotted column_name preserved.
  • Discovery layer correctly flattens parent STRUCTs via BigQueryColumn.flatten(); the wrapper exposes .name (dotted display), .quoted (segment-quoted SQL ref), and .safe_alias (dot-free CTE alias) as designed.
  • Ran the new nested test alongside 10+ existing non-nested column_anomalies tests in a single dbt test invocation — all 15 PASS with no interference, confirming the projection-alias pattern is backwards-compatible.
  • Re-ran with --defer --favor-state against a prod manifest so the non-nested tests had data and history; metrics for nested and non-nested columns land in data_monitoring_metrics and elementary_test_results with identical schema. The dotted column_name is just a longer string in an otherwise unchanged structure.
  • elementary.on_run_end upload hook works unchanged with the override — metric history persists correctly.

Tested against:

  • dbt-core 1.11.8 / dbt-bigquery 1.11.1
  • elementary package version 0.23.x (this branch)

@tlangton3 tlangton3 marked this pull request as ready for review May 22, 2026 13:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql`:
- Around line 10-13: The loop currently excludes only leaves whose own leaf.mode
== 'REPEATED', but needs to exclude any leaf that has a REPEATED ancestor so
downstream UNNESTs aren't missed; change the logic around the col.flatten()
iteration to skip a leaf if any ancestor in its flattened path is REPEATED
(e.g., inspect the leaf's ancestry/path metadata returned by col.flatten() or
augment flatten to return ancestor modes), and only do expanded.append(leaf)
when no ancestor mode == 'REPEATED' (retain the existing reference to
col.flatten(), leaf.mode, and expanded.append in your change).

In `@macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`:
- Around line 402-423: The macro wrap_column_for_struct_support currently always
includes 'fields': column_obj.fields which breaks non-BigQuery adapters because
dbt's base Column lacks a fields attribute; update the macro to only set the
'fields' key when the attribute exists (e.g. when target.type == 'bigquery' and
column_obj.fields is defined) or use a defined-check (column_obj.fields is
defined) and otherwise omit or set fields to null/empty, ensuring all references
inside the returned dict (name, column, quoted, safe_alias, dtype, data_type,
fields) remain valid for non-BigQuery Column objects.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 281a98fa-e3f9-47ef-b12d-ec7d113d1681

📥 Commits

Reviewing files that changed from the base of the PR and between ab1a10b and d45a775.

📒 Files selected for processing (3)
  • macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql
  • macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql
  • macros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql

Address CodeRabbit findings:

1. `BigQueryColumn.flatten()` discards ancestor modes, so a NULLABLE leaf
   under a REPEATED ancestor still satisfied the previous `leaf.mode !=
   'REPEATED'` filter. Add `bq_safe_leaf_names` + `_bq_walk_collect`, an
   ancestor-aware walker that returns only leaves with no REPEATED
   ancestor in their path. Filter `flatten()` output against this set.

2. `wrap_column_for_struct_support` unconditionally read `column_obj.fields`,
   which raised on non-BigQuery adapters (base `Column` lacks `fields`).
   Guard with `column_obj.fields is defined` and default to an empty list,
   so the wrapper is safe on Snowflake, Postgres, Redshift, etc.
@tlangton3 tlangton3 requested a deployment to elementary_test_env May 22, 2026 14:14 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant