Remove Empty Chunks in Cloudfetch concatenation#814
Open
jprakash-db wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Description
ThriftResultSet.fetchmany_arrowandfetchall_arrowpreviously appended every chunkreturned by the underlying queue — including the 0-row placeholder that
CloudFetchQueue._create_empty_table()emits whenself.table is None— intopartial_result_chunks, and then handed the list toconcat_table_chunks.When the placeholder's schema differs from the real downloaded chunks (which it can,
because it is built from
schema_bytesthat may be stale, or schemaless whenschema_bytes is None),pyarrow.concat_tables(..., promote_options="default")silently introduces phantom columns filled with NULLs, or — for type mismatches —
raises.
Fix: hold any 0-row chunk aside in a local
zero_row_tableinstead of appending it.The concat list now only ever contains real chunks, which share a consistent schema.
If every chunk turned out to be 0-row (genuinely empty result set), fall back to
appending the held-aside placeholder so the method still returns a valid
pyarrow.Table.CloudFetchQueueitself is unchanged. The columnar fetch paths are unchanged(they only run with
ColumnQueue, whose empty slices always carry the right schema).The SEA arrow methods are unchanged (single queue call per invocation, no concat).
How is this tested?
Added 4 regression tests in
tests/unit/test_fetches.py:test_fetchall_arrow_drops_mismatched_empty_placeholder— first queue returns a0-row placeholder with column
stale_col; second returns real data withcol0.Asserts the result has only
col0. Verified to fail against the pre-fix code with['stale_col', 'col0'] != ['col0'].test_fetchmany_arrow_drops_mismatched_empty_placeholder— same forfetchmany_arrow.test_fetchall_arrow_all_empty_returns_zero_row_table— every chunk is 0-row;asserts a
pyarrow.Tablewithnum_rows == 0is returned (the held-aside fallback fires).test_fetchmany_arrow_all_empty_returns_zero_row_table— same forfetchmany_arrow.Full unit suite (
pytest tests/unit -x): 765 passed, 4 skipped.Integration / e2e to be run separately.