Skip to content

fix: replace default-allow cast unwrap with closed-by-default allowlist (timestamp + integer widening)#21908

Open
discord9 wants to merge 20 commits into
apache:mainfrom
discord9:fix/lossy-unwrap-cast-comparisons
Open

fix: replace default-allow cast unwrap with closed-by-default allowlist (timestamp + integer widening)#21908
discord9 wants to merge 20 commits into
apache:mainfrom
discord9:fix/lossy-unwrap-cast-comparisons

Conversation

@discord9
Copy link
Copy Markdown
Contributor

@discord9 discord9 commented Apr 29, 2026

Which issue does this PR close?

Closes #22142.

Rationale for this change

The previous unwrap_cast optimization was default-allow: it would move the cast from the
expression side to the literal side for virtually any CAST(column AS T) <op> literal_T pattern.
This is semantically unsafe for many casts — the rewritten predicate can match different rows
than the original.

This PR replaces the default-allow model with a closed-by-default allowlist. Only cast
patterns proven to preserve comparison semantics are eligible. The literal must also
round-trip exactly through the source type and back.

Allowlist (8 families)

Family Allowed Blocked
Timestamp Same-timezone precision widening Cross-timezone, precision narrowing 1
Integer Same-sign widening, unsigned→larger signed, identity Narrowing, signed→unsigned
Binary FixedSizeBinary→Binary, identity Narrowing
Int→String Eq/NotEq/IsDistinctFrom only String→Int, Lt/Gt (string ordering ≠ int ordering)
Dict/String Any direction within Utf8/LargeUtf8/Utf8View/Dictionary
Decimal Widening: p₂−s₂ ≥ p₁−s₁ ∧ s₂ ≥ s₁ Narrowing (fewer integer digits or smaller scale)
Int→Decimal p ≥ min_int_digits + s (e.g. Int32 needs Decimal(12,2)) p insufficient for full integer range
Date↔Int Date32↔Int32, Date64↔Int64 (same bit width)

Key design decisions

  1. is_integer_string_safe(from, to, op) — operator-gated. Only allows Int→String direction; blocks String→Int (multiple strings map to same integer). Inequality operators (Lt/Gt) blocked because string ordering differs from integer ordering.
  2. is_decimal_safe — widening-based. p₂−s₂ ≥ p₁−s₁ (integer digits not reduced) and s₂ ≥ s₁ (scale not reduced). Integer→Decimal requires p ≥ min_int_digits + s to prevent column-value overflow.
  3. is_dictionary_string_widening_safe — all directions allowed. String constants can't overflow; round-trip check catches actual truncation.
  4. is_integer_widening_safe — includes Date32↔Int32 and Date64↔Int64 (same bit width internally).
  5. Literal round-trip checklit → from_type → to_type must equal original, same type. Catches non-boundary timestamp literals, out-of-range integer literals, and fractional decimal literals.
  6. try_cast_string_literaltry_cast_string_like_literal — uses ScalarValue::cast_to for Int↔String consistency instead of manual parse/format.

Behavior comparison table

Expression Before (default-allow) After (allowlist) Notes
CAST(c1:Int32 AS Int64) < 10 unwrapped unwrapped Integer widening, literal fits
CAST(c1:Int32 AS UTF8) = '123' unwrapped unwrapped Int→String with Eq, round-trip safe
CAST(c1:Int32 AS UTF8) < '123' unwrapped kept String ordering ≠ int ordering for inequalities
CAST(c1:Int32 AS UTF8) = '0123' unwrapped kept "0123"→123→"123"≠"0123", round-trip fails
CAST(c1:Int32 AS Decimal(10,2)) = 123.00 unwrapped kept p−s=8 < 10 integer digits needed
CAST(c1:Int32 AS Decimal(12,2)) = 123.00 unwrapped unwrapped p−s=10 ≥ 10, safe
CAST(c6:UInt64 AS Decimal(20,0)) < 5 unwrapped unwrapped p=20 ≥ 20 integer digits needed
CAST(c3:Decimal(18,2) AS Decimal(18,1)) = 123 unwrapped kept Scale narrowing, truncation risk
CAST(c3:Decimal(10,2) AS Decimal(18,4)) = 123 unwrapped unwrapped Scale widening, safe
CAST(c3:Decimal(18,2) AS Int64) = 16 unwrapped kept Decimal→Int truncates fractional part
CAST(tag:Dict AS Utf8) = 'value' unwrapped unwrapped Dict↔String, any direction safe
CAST(str:Utf8 AS LargeUtf8) = 'value' unwrapped unwrapped String widening
CAST(ts_ns AS timestamp(ms)) = lit_ms unwrapped kept ⭐ Timestamp precision downcast is many→one
CAST(ts_ms AS timestamp(ns)) = lit_ns_boundary unwrapped unwrapped Literal round-trips exactly
CAST(ts_ms AS timestamp(ns)) = lit_ns_off_boundary unwrapped kept Off-boundary ns literal fails round-trip
CAST(f:Float64 AS Int32) < 1 unwrapped kept Float→int is lossy
CAST(Date32_col AS Int32) = 123 unwrapped unwrapped Date32 = Int32 internally

What changes are included in this PR?

  • datafusion/expr-common/src/casts.rs — Core allowlist + is_integer_string_safe(op-gated) + is_dictionary_string_widening_safe + is_decimal_safe(widen-based, Int→Dec precision check) + try_cast_string_like_literal (uses ScalarValue::cast_to)
  • datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs — Updated tests for Int→String, Dict/String, Decimal widening, Integer→Decimal
  • datafusion/physical-expr/src/simplifier/unwrap_cast.rs — Updated tests for Decimal narrowing blocked, Dict wrapping
  • datafusion/sqllogictest/test_files/unwrap_cast.slt — Updated plan expectations for all unwrap changes

Are these changes tested?

New test coverage added:

datafusion/expr-common/src/casts.rs — allowlist test (test_comparison_unwrap_cast_allowlist):

  • Integer→String with Eq (allowed) and Lt (blocked by operator gate)
  • String→Integer blocked (e.g. '0123' and '+123' map to same integer)
  • Dictionary/String any direction allowed
  • Decimal widening: Dec(10,2)→Dec(18,4) allowed; Dec(18,4)→Dec(18,2) scale narrowing blocked
  • Integer→Decimal: Int32→Dec(18,4) allowed (18≥10+4=14); Int32→Dec(9,2) blocked (9<10+2=12)
  • Decimal→Integer blocked
  • Date32↔Int32, Date64↔Int64 (same bit width)

datafusion/expr-common/src/casts.rs — literal round-trip test (test_comparison_unwrap_literal_cast_safety):

  • Int→String Eq/NotEq unwraps; Lt blocked; '0123' round-trip failure blocked
  • Decimal widening cross-type: Decimal64(10,2)→Decimal128(20,4) unwraps; narrowing blocked
  • Int32→Decimal(12,2) unwraps (p−s=10≥10); Int32→Decimal(10,2) blocked (p−s=8<10)
  • Int64→Decimal(38,0)→Int64: Decimal→Integer direction still blocked

datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs:

  • Int→String Eq/NotEq unwraps (cf. test_unwrap_cast_comparison, _unsigned)
  • Dictionary wrapping/unwrapping now unwraps (test_unwrap_cast_comparison_string)
  • Decimal InList widening unwraps (test_not_unwrap_list_cast_comparison)
  • Int32→Decimal(12,2) unwraps; Int32→Decimal(10,2) blocked (test_unwrap_cast_with_decimal_lit_comparison)

datafusion/physical-expr/src/simplifier/unwrap_cast.rs:

  • Decimal narrowing (18,2→18,1, 18,2→10,2, 18,2→12,4) blocked (test_unwrap_lossy_decimal_precision_casts)
  • Int64→Decimal(10,2) blocked (p=10<19) (test_no_unwrap_source_domain_reducing_casts)
  • Dictionary wrapping unwraps

datafusion/sqllogictest/test_files/unwrap_cast.slt:

  • Integer-to-decimal kept when precision insufficient
  • Plan expectations updated for all unwrap changes

Are there any user-facing changes?

Plan output differences:

  • Int→String: CAST(c1 AS Utf8) = '123'c1 = Int32(123) (Eq/NotEq only, unwrapped)
  • Dict/String: CAST(tag AS Utf8) = 'value'tag = Utf8("value") (unwrapped)
  • Decimal widening: CAST(c AS Dec(18,4)) = 123 where column is Dec(10,2) → unwrapped
  • Integer→Decimal: CAST(c AS Dec(12,2)) = 123 where c is Int32 → unwrapped (p-s ≥ 10)
  • Integer→Decimal: CAST(c AS Dec(10,2)) = 123 where c is Int32 → kept (p-s=8 < 10)
  • All other non-allowlisted casts: CAST kept in plan. Query semantics unchanged.

Footnotes

  1. Cross-timezone is blocked because cast_between_timestamp currently only adjusts precision units (ms→ns), not timezone offsets. Same i64 values in different timezones represent different instants but would falsely pass the round-trip check. A future fix would handle the timezone offset in cast_between_timestamp. A TODO is left in the code.

@github-actions github-actions Bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an end-to-end example (e.g. SLT) that showcases the bug being fixed here?

@discord9
Copy link
Copy Markdown
Contributor Author

Is there an end-to-end example (e.g. SLT) that showcases the bug being fixed here?

will add now, althought I just realize that int 32->int16 could also be lossy cast so need fix too

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2026
@discord9
Copy link
Copy Markdown
Contributor Author

@Jefffrey added slt test now, this pr might seem a little large now, but it's mostly tests and simple conservative judgment criteria and should be safe

Copy link
Copy Markdown
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution!

As @Jefffrey suggests, this PR would be easier to review if you clarified the behavior you want to change. For example, "forbid lossy cast" is quite vague about specifically which casts were previously allowed, why those casts are lossy, and which casts will now be disallowed as a result of this PR. It would also be nice to create a separate GitHub issue, as the PR template requests.

I wonder whether we want closed-by-default or open-by-default behavior. That is, instead of this approach, we could start by assuming that all casts are unsafe and then enumerate the specific casts that we want to allow. That would be safer, e.g., in the event of new types or new casts being added. Implementation might be simpler as well.

Comment thread datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs Outdated
@discord9
Copy link
Copy Markdown
Contributor Author

Thanks for this contribution!

As @Jefffrey suggests, this PR would be easier to review if you clarified the behavior you want to change. For example, "forbid lossy cast" is quite vague about specifically which casts were previously allowed, why those casts are lossy, and which casts will now be disallowed as a result of this PR. It would also be nice to create a separate GitHub issue, as the PR template requests.

I wonder whether we want closed-by-default or open-by-default behavior. That is, instead of this approach, we could start by assuming that all casts are unsafe and then enumerate the specific casts that we want to allow. That would be safer, e.g., in the event of new types or new casts being added. Implementation might be simpler as well.

sounds reasonable, how about for now only allow downcast like optimize cast(ts_ms)=lit_ns to ts_ms==lit_ms, does sound much safer and feels like the original intend

@github-actions github-actions Bot added the core Core DataFusion crate label May 13, 2026
@discord9 discord9 changed the title fix: forbid lossy cast(for now) fix: replace default-allow cast unwrap with closed-by-default allowlist (timestamp + integer widening) May 13, 2026
discord9 added 14 commits May 14, 2026 15:21
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Replace default-allow cast unwrapping with an explicit allowlist in
is_supported_comparison_unwrap_cast. Only verified-safe type pairs are
allowed; the literal must also round-trip exactly.

Allowed families:
- Timestamp widening (same timezone, same-or-finer precision)
- Integer widening (same-sign, unsigned->larger signed, identity)
- Binary widening (FixedSizeBinary->Binary, identity)
- Integer->String (Eq/NotEq/IsDistinctFrom only; string ordering
  differs from integer ordering for inequalities)
- Dictionary/String family (any direction, round-trip catches
  truncation)
- Decimal widening (int-digits not reduced, scale not reduced)
- Integer->Decimal (when p-s >= integer digit count)
- Date32<->Int32, Date64<->Int64 (same bit width)

Explicitly blocked:
- String->Integer (multiple strings map to same int)
- Integer narrowing
- Signed->unsigned
- Timestamp cross-timezone
- Decimal narrowing
- Decimal->Integer (truncates fractional part)

try_cast_string_literal renamed to try_cast_string_like_literal
and uses ScalarValue::cast_to for int<->string consistency.
is_integer_string_safe takes op for inequality gate.

Tests updated in casts.rs, optimizer, physical simplifier, and SLT.
@discord9 discord9 force-pushed the fix/lossy-unwrap-cast-comparisons branch from b813951 to 0915c30 Compare May 15, 2026 04:33
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 15, 2026
@discord9 discord9 marked this pull request as draft May 15, 2026 05:01
@discord9 discord9 force-pushed the fix/lossy-unwrap-cast-comparisons branch from dccf390 to c5c4af7 Compare May 15, 2026 05:01
@github-actions github-actions Bot removed the documentation Improvements or additions to documentation label May 15, 2026
@discord9 discord9 force-pushed the fix/lossy-unwrap-cast-comparisons branch from c5c4af7 to a4447a1 Compare May 15, 2026 05:02
@discord9 discord9 force-pushed the fix/lossy-unwrap-cast-comparisons branch from 4a7272e to 2f065b2 Compare May 15, 2026 11:37
@discord9
Copy link
Copy Markdown
Contributor Author

add a lot of allow list to avoid too much regression

@discord9 discord9 marked this pull request as ready for review May 15, 2026 11:41
Signed-off-by: discord9 <discord9@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsafe comparison cast rewriting in ExprSimplifier silently produces wrong query results

3 participants