Skip to content

DuckDBPyRelation::{order, limit, pl(lazy=True)} can cause panics and silently drop rows with polars #460

@OutSquareCapital

Description

@OutSquareCapital

What happens?

The interaction between polars and duckdb can lead to unexpected results and panics.

See the example below. Cause the same issues on 1.5.2 and 1.5.3.dev24.

my current project aims to make polars <-> duckdb interops a first class citizen.
Thus, I would be very keen to help with the polars_io module, if some documentation can be found somewhere.

To Reproduce

Minimal repro:

import duckdb
import polars as pl


def main() -> None:
    from_df, from_python = _setup()
    print("---- Original data ----")
    print("From DataFrame:")
    from_df.show()
    print("From Python:")
    from_python.show()
    print("---- After transformation ----")
    print("From DataFrame sorted + limit -> 0 rows !")
    _test_rel(from_df)
    print("From Python sorted + limit -> OK !")
    _test_rel(from_python)
    print("Conversion to DataFrame and execution there -> OK !")
    _test_pl(from_df, lazy=False)
    print("Conversion to LazyFrame and execution there -> will crash on show !")
    _test_pl(from_df, lazy=True)


def _test_rel(rel: duckdb.DuckDBPyRelation) -> None:
    rel_sorted = rel.order("x ASC NULLS FIRST").limit(1)
    print(rel_sorted.explain())
    rel_sorted.show()
    print("-" * 50)


def _test_pl(rel: duckdb.DuckDBPyRelation, *, lazy: bool) -> None:
    lf = rel.pl(lazy=lazy).lazy().sort("x").limit(1)
    print(lf.explain(format="tree"))
    lf.show()
    print("-" * 50)


def _setup() -> tuple[duckdb.DuckDBPyRelation, duckdb.DuckDBPyRelation]:
    df = pl.DataFrame({"x": [3, 1, 2]})
    conn = duckdb.register("df_registered", df)
    return conn.table("df_registered"), conn.sql("SELECT unnest([3, 1, 2]) AS x")


if __name__ == "__main__":
    main()

Output:

PS C:\Users\tibo\python_codes\pql> uv run t.py
---- Original data ----
From DataFrame:
┌───────┐
│   x   │
│ int64 │
├───────┤
│     3 │
│     1 │
│     2 │
└───────┘

From Python:
┌───────┐
│   x   │
│ int32 │
├───────┤
│     3 │
│     1 │
│     2 │
└───────┘

---- After transformation ----
From DataFrame sorted + limit -> 0 rows !
┌───────────────────────────┐
│           TOP_N           │
│    ────────────────────   │
│           Top: 1          │
│                           │
│         Order By:         │
│    df_registered.x ASC    │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         ARROW_SCAN        │
│    ────────────────────   │
│    Function: ARROW_SCAN   │
│       Projections: x      │
│                           │
│          Filters:         │
│   optional: x IS NULL OR  │
│     Dynamic Filter (x)    │
│                           │
│           ~1 row          │
└───────────────────────────┘


┌────────┐
│   x    │
│ int64  │
└────────┘
  0 rows

--------------------------------------------------
From Python sorted + limit -> OK !
┌───────────────────────────┐
│           TOP_N           │
│    ────────────────────   │
│           Top: 1          │
│                           │
│         Order By:         │
│unnamed_relation_d474663ed5│
│        49ad5a.x ASC       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             x             │
│                           │
│           ~1 row          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           UNNEST          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         DUMMY_SCAN        │
└───────────────────────────┘


┌───────┐
│   x   │
│ int32 │
├───────┤
│     1 │
└───────┘

--------------------------------------------------
Conversion to DataFrame and execution there -> OK !
             0                        1                             2
   ┌────────────────────────────────────────────────────────────────────────────
   │
   │    ╭─────────╮
 0 │    │ SORT BY │
   │    ╰────┬┬───╯
   │         ││
   │         │╰───────────────────────╮
   │         │                        │
   │  ╭──────┴──────╮                 │
   │  │ expression: │             ╭───┴────╮
 1 │  │ col("x")    │             │ FILTER │
   │  ╰─────────────╯             ╰───┬┬───╯
   │                                  ││
   │                                  │╰────────────────────────────╮
   │                                  │                             │
   │                   ╭──────────────┴───────────────╮  ╭──────────┴──────────╮
   │                   │ predicate:                   │  │ FROM:               │
 2 │                   │ col("x").dynamic_predicate() │  │ DF ["x"]            │
   │                   ╰──────────────────────────────╯  │ PROJECT */1 COLUMNS │
   │                                                     ╰─────────────────────╯

shape: (1, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
└─────┘
--------------------------------------------------
Conversion to LazyFrame and execution there -> will crash on show !
             0                                                     1
   ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   │
   │    ╭─────────╮
 0 │    │ SORT BY │
   │    ╰────┬┬───╯
   │         ││
   │         │╰────────────────────────────────────────────────────╮
   │         │                                                     │
   │         │         ╭───────────────────────────────────────────┴────────────────────────────────────────────╮
   │  ╭──────┴──────╮  │ SORT BY [slice: (0, 1, dynamic_pred: 66cb64c6-6ffe-4ca4-b3ef-e9b6ee8c96c1)] [col("x")] │
   │  │ expression: │  │   PYTHON SCAN []                                                                       │
 1 │  │ col("x")    │  │   PROJECT */1 COLUMNS                                                                  │
   │  ╰─────────────╯  │   SELECTION: col("x").dynamic_predicate()                                              │
   │                   ╰────────────────────────────────────────────────────────────────────────────────────────╯


thread 'tokio-runtime-worker' (17564) panicked at crates\polars-plan\src\plans\conversion\dsl_to_ir\expr_to_ir.rs:639:13:
internal error: entered unreachable code
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
Traceback (most recent call last):
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\duckdb\polars_io.py", line 307, in source_generator
    yield pl.from_arrow(record_batch).filter(predicate)  # type: ignore[arg-type,misc,unused-ignore]
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\dataframe\frame.py", line 5509, in filter
    .collect(optimizations=QueryOptFlags._eager())
     ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\_utils\deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\lazyframe\opt_flags.py", line 343, in wrapper
    return function(*args, **kwargs)
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\lazyframe\frame.py", line 2510, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: internal error: entered unreachable code

thread 'async-executor-1' (11736) panicked at crates\polars-stream\src\nodes\io_sources\batch.rs:107:18:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(9), "internal error: entered unreachable code", ...)
Traceback (most recent call last):
  File "C:\Users\tibo\python_codes\pql\t.py", line 44, in <module>
    main()
    ~~~~^^
  File "C:\Users\tibo\python_codes\pql\t.py", line 20, in main
    _test_pl(from_df, lazy=True)
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tibo\python_codes\pql\t.py", line 33, in _test_pl
    lf.show()
    ~~~~~~~^^
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\lazyframe\frame.py", line 9680, in show
    self.head(limit).collect(engine="streaming").show(
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\_utils\deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\lazyframe\opt_flags.py", line 343, in wrapper
    return function(*args, **kwargs)
  File "C:\Users\tibo\python_codes\pql\.venv\Lib\site-packages\polars\lazyframe\frame.py", line 2510, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(9), "internal error: entered unreachable code", ...)

OS:

Windows

DuckDB Package Version:

v1.5.3.dev24

Python Version:

3.13.7

Full Name:

Stettler Thibaud

Affiliation:

None

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions