fix: replace default-allow cast unwrap with closed-by-default allowlist (timestamp + integer widening)#21908
fix: replace default-allow cast unwrap with closed-by-default allowlist (timestamp + integer widening)#21908discord9 wants to merge 20 commits into
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
Is there an end-to-end example (e.g. SLT) that showcases the bug being fixed here?
will add now, althought I just realize that int 32->int16 could also be lossy cast so need fix too |
|
@Jefffrey added slt test now, this pr might seem a little large now, but it's mostly tests and simple conservative judgment criteria and should be safe |
neilconway
left a comment
There was a problem hiding this comment.
Thanks for this contribution!
As @Jefffrey suggests, this PR would be easier to review if you clarified the behavior you want to change. For example, "forbid lossy cast" is quite vague about specifically which casts were previously allowed, why those casts are lossy, and which casts will now be disallowed as a result of this PR. It would also be nice to create a separate GitHub issue, as the PR template requests.
I wonder whether we want closed-by-default or open-by-default behavior. That is, instead of this approach, we could start by assuming that all casts are unsafe and then enumerate the specific casts that we want to allow. That would be safer, e.g., in the event of new types or new casts being added. Implementation might be simpler as well.
sounds reasonable, how about for now only allow downcast like optimize |
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: discord9 <discord9@163.com>
Replace default-allow cast unwrapping with an explicit allowlist in is_supported_comparison_unwrap_cast. Only verified-safe type pairs are allowed; the literal must also round-trip exactly. Allowed families: - Timestamp widening (same timezone, same-or-finer precision) - Integer widening (same-sign, unsigned->larger signed, identity) - Binary widening (FixedSizeBinary->Binary, identity) - Integer->String (Eq/NotEq/IsDistinctFrom only; string ordering differs from integer ordering for inequalities) - Dictionary/String family (any direction, round-trip catches truncation) - Decimal widening (int-digits not reduced, scale not reduced) - Integer->Decimal (when p-s >= integer digit count) - Date32<->Int32, Date64<->Int64 (same bit width) Explicitly blocked: - String->Integer (multiple strings map to same int) - Integer narrowing - Signed->unsigned - Timestamp cross-timezone - Decimal narrowing - Decimal->Integer (truncates fractional part) try_cast_string_literal renamed to try_cast_string_like_literal and uses ScalarValue::cast_to for int<->string consistency. is_integer_string_safe takes op for inequality gate. Tests updated in casts.rs, optimizer, physical simplifier, and SLT.
b813951 to
0915c30
Compare
dccf390 to
c5c4af7
Compare
c5c4af7 to
a4447a1
Compare
Signed-off-by: discord9 <discord9@163.com>
4a7272e to
2f065b2
Compare
|
add a lot of allow list to avoid too much regression |
Signed-off-by: discord9 <discord9@163.com>
Which issue does this PR close?
Closes #22142.
Rationale for this change
The previous
unwrap_castoptimization was default-allow: it would move the cast from theexpression side to the literal side for virtually any
CAST(column AS T) <op> literal_Tpattern.This is semantically unsafe for many casts — the rewritten predicate can match different rows
than the original.
This PR replaces the default-allow model with a closed-by-default allowlist. Only cast
patterns proven to preserve comparison semantics are eligible. The literal must also
round-trip exactly through the source type and back.
Allowlist (8 families)
p₂−s₂ ≥ p₁−s₁ ∧ s₂ ≥ s₁Key design decisions
is_integer_string_safe(from, to, op)— operator-gated. Only allows Int→String direction; blocks String→Int (multiple strings map to same integer). Inequality operators (Lt/Gt) blocked because string ordering differs from integer ordering.is_decimal_safe— widening-based.p₂−s₂ ≥ p₁−s₁(integer digits not reduced) ands₂ ≥ s₁(scale not reduced). Integer→Decimal requiresp ≥ min_int_digits + sto prevent column-value overflow.is_dictionary_string_widening_safe— all directions allowed. String constants can't overflow; round-trip check catches actual truncation.is_integer_widening_safe— includes Date32↔Int32 and Date64↔Int64 (same bit width internally).lit → from_type → to_typemust equal original, same type. Catches non-boundary timestamp literals, out-of-range integer literals, and fractional decimal literals.try_cast_string_literal→try_cast_string_like_literal— usesScalarValue::cast_tofor Int↔String consistency instead of manual parse/format.Behavior comparison table
CAST(c1:Int32 AS Int64) < 10CAST(c1:Int32 AS UTF8) = '123'CAST(c1:Int32 AS UTF8) < '123'CAST(c1:Int32 AS UTF8) = '0123'CAST(c1:Int32 AS Decimal(10,2)) = 123.00CAST(c1:Int32 AS Decimal(12,2)) = 123.00CAST(c6:UInt64 AS Decimal(20,0)) < 5CAST(c3:Decimal(18,2) AS Decimal(18,1)) = 123CAST(c3:Decimal(10,2) AS Decimal(18,4)) = 123CAST(c3:Decimal(18,2) AS Int64) = 16CAST(tag:Dict AS Utf8) = 'value'CAST(str:Utf8 AS LargeUtf8) = 'value'CAST(ts_ns AS timestamp(ms)) = lit_msCAST(ts_ms AS timestamp(ns)) = lit_ns_boundaryCAST(ts_ms AS timestamp(ns)) = lit_ns_off_boundaryCAST(f:Float64 AS Int32) < 1CAST(Date32_col AS Int32) = 123What changes are included in this PR?
datafusion/expr-common/src/casts.rs— Core allowlist +is_integer_string_safe(op-gated) +is_dictionary_string_widening_safe+is_decimal_safe(widen-based, Int→Dec precision check) +try_cast_string_like_literal(usesScalarValue::cast_to)datafusion/optimizer/src/simplify_expressions/unwrap_cast.rs— Updated tests for Int→String, Dict/String, Decimal widening, Integer→Decimaldatafusion/physical-expr/src/simplifier/unwrap_cast.rs— Updated tests for Decimal narrowing blocked, Dict wrappingdatafusion/sqllogictest/test_files/unwrap_cast.slt— Updated plan expectations for all unwrap changesAre these changes tested?
New test coverage added:
datafusion/expr-common/src/casts.rs— allowlist test (test_comparison_unwrap_cast_allowlist):Dec(10,2)→Dec(18,4)allowed;Dec(18,4)→Dec(18,2)scale narrowing blockedInt32→Dec(18,4)allowed (18≥10+4=14);Int32→Dec(9,2)blocked (9<10+2=12)datafusion/expr-common/src/casts.rs— literal round-trip test (test_comparison_unwrap_literal_cast_safety):Decimal64(10,2)→Decimal128(20,4)unwraps; narrowing blockeddatafusion/optimizer/src/simplify_expressions/unwrap_cast.rs:test_unwrap_cast_comparison,_unsigned)test_unwrap_cast_comparison_string)test_not_unwrap_list_cast_comparison)test_unwrap_cast_with_decimal_lit_comparison)datafusion/physical-expr/src/simplifier/unwrap_cast.rs:18,2→18,1,18,2→10,2,18,2→12,4) blocked (test_unwrap_lossy_decimal_precision_casts)test_no_unwrap_source_domain_reducing_casts)datafusion/sqllogictest/test_files/unwrap_cast.slt:Are there any user-facing changes?
Plan output differences:
CAST(c1 AS Utf8) = '123'→c1 = Int32(123)(Eq/NotEq only, unwrapped)CAST(tag AS Utf8) = 'value'→tag = Utf8("value")(unwrapped)CAST(c AS Dec(18,4)) = 123where column is Dec(10,2) → unwrappedCAST(c AS Dec(12,2)) = 123where c is Int32 → unwrapped (p-s ≥ 10)CAST(c AS Dec(10,2)) = 123where c is Int32 → kept (p-s=8 < 10)Footnotes
Cross-timezone is blocked because
cast_between_timestampcurrently only adjusts precision units (ms→ns), not timezone offsets. Same i64 values in different timezones represent different instants but would falsely pass the round-trip check. A future fix would handle the timezone offset incast_between_timestamp. A TODO is left in the code. ↩