Support BigQuery nested STRUCT fields in anomaly tests#1012
Conversation
Allows column_anomalies and dimension_anomalies to reference nested STRUCT leaves on BigQuery (e.g. user.address.city) instead of only top-level columns. A single column-discovery wrapper segment-quotes nested references (`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED ancestors are out of scope (would require UNNEST). test_all_columns_anomalies is unchanged - users opt in by passing column_name=user.address.city explicitly to avoid ballooning the test surface on wide STRUCT schemas.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds BigQuery-safe segment quoting, dot-free aliasing, and struct-wrapping helpers, then applies them across column monitor selection, the column monitoring query (projections and metric expressions), and dimension concatenation/bucketing logic. ChangesBigQuery Nested Field Support via Safe Aliasing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 @tlangton3 |
|
End-to-end validated against a real BigQuery dataset.
Tested against:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sql`:
- Around line 10-13: The loop currently excludes only leaves whose own leaf.mode
== 'REPEATED', but needs to exclude any leaf that has a REPEATED ancestor so
downstream UNNESTs aren't missed; change the logic around the col.flatten()
iteration to skip a leaf if any ancestor in its flattened path is REPEATED
(e.g., inspect the leaf's ancestry/path metadata returned by col.flatten() or
augment flatten to return ancestor modes), and only do expanded.append(leaf)
when no ancestor mode == 'REPEATED' (retain the existing reference to
col.flatten(), leaf.mode, and expanded.append in your change).
In `@macros/edr/data_monitoring/monitors_query/column_monitoring_query.sql`:
- Around line 402-423: The macro wrap_column_for_struct_support currently always
includes 'fields': column_obj.fields which breaks non-BigQuery adapters because
dbt's base Column lacks a fields attribute; update the macro to only set the
'fields' key when the attribute exists (e.g. when target.type == 'bigquery' and
column_obj.fields is defined) or use a defined-check (column_obj.fields is
defined) and otherwise omit or set fields to null/empty, ensuring all references
inside the returned dict (name, column, quoted, safe_alias, dtype, data_type,
fields) remain valid for non-BigQuery Column objects.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 281a98fa-e3f9-47ef-b12d-ec7d113d1681
📒 Files selected for processing (3)
macros/edr/data_monitoring/data_monitors_configuration/get_column_monitors.sqlmacros/edr/data_monitoring/monitors_query/column_monitoring_query.sqlmacros/edr/data_monitoring/monitors_query/dimension_monitoring_query.sql
Address CodeRabbit findings: 1. `BigQueryColumn.flatten()` discards ancestor modes, so a NULLABLE leaf under a REPEATED ancestor still satisfied the previous `leaf.mode != 'REPEATED'` filter. Add `bq_safe_leaf_names` + `_bq_walk_collect`, an ancestor-aware walker that returns only leaves with no REPEATED ancestor in their path. Filter `flatten()` output against this set. 2. `wrap_column_for_struct_support` unconditionally read `column_obj.fields`, which raised on non-BigQuery adapters (base `Column` lacks `fields`). Guard with `column_obj.fields is defined` and default to an empty list, so the wrapper is safe on Snowflake, Postgres, Redshift, etc.
Allows
column_anomaliesanddimension_anomaliesto reference nested STRUCT leaves on BigQuery (e.g.user.address.city) instead of only top-level columns.A single column-discovery wrapper segment-quotes nested references (
`a`.`b`.`c`) and projects the monitored column with a dot-free CTE alias so the path survives into downstream aggregates. Non-nested columns and non-BigQuery adapters are byte-equivalent to today's behaviour. REPEATED ancestors are out of scope (would requireUNNEST).test_all_columns_anomaliesis unchanged — users opt in by passingcolumn_name=user.address.cityexplicitly to avoid ballooning the test surface on wide STRUCT schemas.What changes
get_column_obj_and_monitorsflattens BigQuery STRUCT columns viaBigQueryColumn.flatten()and wraps each discovered column with a dict carrying.name(dotted display form),.quoted(segment-quoted SQL ref), and.safe_alias(dot-free identifier). Top-level STRUCTs are kept alongside their leaves so existingcolumn_name=userbehaviour is preserved.column_monitoring_queryprojects the monitored column as<quoted> as <safe_alias>and references the alias in metric aggregates.select_dimensions_columnsapplies the same pattern to nested dimensions.dimension_monitoring_querysegment-quotes dimension expressions before they are concatenated intodimension_value.Why two representations
BigQueryColumn.quotedwraps the whole string in one set of backticks, so a flattened nested column's.quotedis`user.address.city`— which BigQuery treats as a single column literally nameduser.address.city. Even with correct segment-quoting, projectingselect user.address.city from tinto a CTE without an alias names the resulting columncity, losing the path. The wrapper exposes both.quoted(segment-quoted source ref) and.safe_alias(dot-free CTE alias) so the projection-alias pattern composes cleanly and downstream macros stay nesting-agnostic.Testing
Local validation via
dbt parseand arun-operationharness confirmed every SQL fingerprint:user.address.city→`user`.`address`.`city`select `user`.`address`.`city` as user__address__city from tcoalesce(sum(case when user__address__city is null then 1 else 0 end), 0)asnull_countcolumn_name:user.address.city(dotted display preserved for alerts)get_column_data_typeBigQuery dispatch works on the wrapped dict via subscript accessEnd-to-end execution against BigQuery to follow.
Summary by CodeRabbit
Bug Fixes
Refactor