Skip to content

Slow logical planning (>1s) for regexp_like with nested {N} quantifiers #22185

@Dandandan

Description

@Dandandan

Describe the bug

A SQL statement containing regexp_like(..., 'a{5}{5}{5}{5}{5}{5}{5}{5}') (8+ nested counted-repetition quantifiers) takes >1 second to plan (logical + physical planning, no execution required). At higher nesting, planning still takes ~1 s because the regex engine hits a built-in size limit and bails — but the limit itself takes ~1 s to reach.

This is a planning-time DoS: a 60-byte SQL string from an untrusted source can stall a query node for a second or more before any data is touched.

To Reproduce

Using the public SessionContext::sql API (no execution required):

use datafusion::prelude::SessionContext;

#[tokio::main]
async fn main() {
    let ctx = SessionContext::new();
    let sql = "SELECT regexp_like('aaaaa', 'a{5}{5}{5}{5}{5}{5}{5}{5}');";
    let t0 = std::time::Instant::now();
    let df = ctx.sql(sql).await.unwrap();
    let _ = df.create_physical_plan().await;
    println!("planning took {:?}", t0.elapsed());
}

Output on M-series Mac, release build:

planning took ~1.0 s

Scaling with nesting count N (measured via the fuzz target's DATAFUSION_FUZZ_PLANNING_TIMEOUT_MS):

{5} nestings planning time
4 6 ms
6 17 ms
8 1008 ms (regex size limit hit)
10 1005 ms
12 757 ms
18 ~1 s (original fuzzer find)

Note that a{5}{5} is treated as (a{5}){5} (or similar) — the regex crate expands the inner counted repetition before noticing the size, so the cost is roughly exponential up to the crate's size_limit (~10 MB by default).

Expected behavior

Planning should either:

  • reject the pattern in <1 ms with a "regex too complex" error, or
  • defer regex compilation entirely to execution time, leaving planning fast.

A 60-byte SQL string should not be able to consume a CPU-second of planning work.

Additional context

Discovered by a cargo fuzz target (fuzz/fuzz_targets/sql_physical_plan.rs) seeded with SQL extracted from datafusion/sqllogictest/test_files/. The fuzz harness panics if planning exceeds 1 s, which is how this surfaced:

physical planning exceeded 1000 ms after 1041 ms for SQL:
SELECT regexp_like('aaaaa', 'a{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5}{5});

Probable root cause is the simplification pass constant-folding regexp_like literals via regex::Regex::new, which is O(product-of-quantifiers) on patterns like a{5}{5}…. The size limit in the regex crate eventually trips, but at a wall-time cost of ~1 s for the smallest triggering input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions