Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799
Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799kongfanshen-0801 wants to merge 5 commits into
Conversation
|
Thanks for the back porting. Could we keep the original commit history, thanks |
a869ab3 to
4328d2f
Compare
|
Thanks for the review! I've restructured the branch to preserve the original commit history:
The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks! |
Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21 |
5dc8186 to
8ad6a81
Compare
Done — ran TPC-H Q21 before/after on the Setup: SF=100 (lineitem 600,037,902 rows), GPORCA, 3-segment single-host demo cluster, 3 timed runs each.
Q21's
So the right-semi shape shrinks the build side from ~200M to ~449K rows, cuts the semijoin hash spill from 512 → 2 batches, and drops requested End-to-end wall-clock is ~3% better here because on this small cluster the total is dominated by the three I also added an ORCA regression test ( |
|
Follow-up on the Q21 numbers above — clarifying why the end-to-end gain is only ~3%, and showing the isolated win. Where Q21's time actually goes (row executor)Q21 scans
Right-semi only changes the build side of the outer semijoin (
The 200M Isolated semijoin: ~4× when it's the dominant costRemoving the 3× full scans (small 10k-row LHS vs a unique 150M-row RHS,
→ ~3.9× faster (≈20 s vs ≈77 s) once the build-side choice is the bottleneck. Row vs vectorizedIn the row executor, steps 1 + 3 (scanning/deforming ~452M rows and row-at-a-time hash/probe) dominate and dilute the win. In a vectorized/columnar engine those steps get much cheaper (column-pruned, batched), so the relative weight of avoiding the 200M-row build+spill grows and the Q21 wall-clock gain becomes more pronounced — consistent with the larger Q21 improvement seen on the vectorized build this was ported from. Summary: the right-semi plan is genuinely chosen and the per-operator win is large (~4× isolated, 200M→449K build, 512→2 batches, 8.6 GB→14 MB); it just sits behind scan/deform-bound work in Q21 on the row executor, hence ~3% end-to-end here. |
34dfda2 to
b5aeefd
Compare
|
Re-ran Q21 (SF=100) at a production-level
At realistic memory the off plan spills the 200M-row semijoin build (8 batches), while on keeps it fully in memory (1 batch) → ~10% end-to-end and total Memory wanted 60 GB → 38 GB. (At the tiny default The end-to-end gain is bounded because Q21 is scan-bound in the row executor: the three On bumping the scale further: SF=500 isn't feasible on this box — the loaded SF=100 set is already ~216 GB, so SF=500 would be ~1 TB+ of table data alone and exceeds the disk. More importantly, in the row executor a larger SF mostly scales both plans together (still scan-bound), so it wouldn't move the ratio much; the spill-avoidance is best demonstrated by the build-side-vs- |
Larger-scale performance comparison (SF≈300, GPORCA)Loaded a larger TPC-H data set (lineitem 1.8B rows, orders 450M rows) to show the right-semi win at scale and the spill it avoids. Note on Q21 itself at this scale: with 1.8B-row To isolate the operator where right-semi is the right choice — a small probe-side driving table against a large, unique build key — I used select count(*) from skeys s where s.k in (select o_orderkey from orders); -- 10K vs 450M
→ ~4.6× faster (≈62 s vs ≈285 s). The off plan must de-duplicate the 450M-row side, spilling 4.1 GB across 1281 batches and asking for ~16.9 GB of work_mem; the right-semi plan builds the hash on the 3,385-row LHS, stays in memory (1 batch, 4 MB) and just streams The win grows with scale (≈3.9× at SF=100 → ≈4.6× here) and, more importantly, eliminates the large-side dedup spill entirely — which is exactly the case this backport targets. (SF=500 was attempted but the loaded heap tables exceed the 1 TB test disk before the load completes — physical size is ~2× the logical size — so this SF≈300 run is the largest that fits here.) |
Hash joins can support semijoin with the LHS input on the right, using the existing logic for inner join, combined with the assurance that only the first match for each inner tuple is considered, which can be achieved by leveraging the HEAP_TUPLE_HAS_MATCH flag. This can be very useful in some cases since we may now have the option to hash the smaller table instead of the larger. Merge join could likely support "Right Semi Join" too. However, the benefit of swapping inputs tends to be small here, so we do not address that in this patch. Note that this patch also modifies a test query in join.sql to ensure it continues testing as intended. With this patch the original query would result in a right-semi-join rather than semi-join, compromising its original purpose of testing the fix for neqjoinsel's behavior for semi-joins. Author: Richard Guo Reviewed-by: wenhui qiu, Alena Rybakina, Japin Li Discussion: https://postgr.es/m/CAMbWs4_X1mN=ic+SxcyymUqFx9bB8pqSLTGJ-F=MHy4PW3eRXw@mail.gmail.com (cherry picked from commit aa86129)
When resetting a HashJoin node for rescans, if it is a single-batch join and there are no parameter changes for the inner subnode, we can just reuse the existing hash table without rebuilding it. However, for join types that depend on the inner-tuple match flags in the hash table, we need to reset these match flags to avoid incorrect results. This applies to right, right-anti, right-semi, and full joins. When I introduced "Right Semi Join" plan shapes in aa86129, I failed to reset the match flags in the hash table for right-semi joins in rescans. This oversight has been shown to produce incorrect results. This patch fixes it. Author: Richard Guo Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com (cherry picked from commit 5668a85)
This commit carries the Cloudberry/Greenplum-specific changes needed on top of the two cherry-picked upstream commits (aa86129, 5668a85), which only touch the upstream PostgreSQL planner/executor files. - nodes.h: move JOIN_RIGHT_SEMI to the END of the JoinType enum. Upstream places it next to JOIN_RIGHT_ANTI, but in the Cloudberry tree that shifts the integer values of the GPDB-specific JOIN_DEDUP_SEMI/REVERSE and JOIN_UNIQUE_* codes. Value-dependent code then corrupts MPP motion planning, producing a degenerate plan ("Gather Motion 0:1" / "Redistribute Motion 1:0") that crashes with SIGSEGV in setupCdbProcessList() during dispatch. Appending keeps every pre-existing enum value stable. - cdbpath.c (cdbpath_motion_for_join, both the serial and parallel switch): handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI. A right-semi join emits inner (build-side) rows, so the inner side must not be replicated, otherwise matched inner rows could be emitted more than once. Without this the new join type would hit the switch default and elog(ERROR, "unexpected join type") at plan time. Note: this feature is only exercised by the PostgreSQL planner (optimizer=off); GPORCA does not generate JOIN_RIGHT_SEMI.
ecd10e7 to
2fa5d61
Compare
Teach GPORCA to produce the PostgreSQL-18 Hash Right Semi / Right Anti Join plan shapes (build the hash table on the smaller left-hand side of a semi/anti join and probe with the larger right side), matching the Postgres planner backport. - New physical operators CPhysicalRightSemiHashJoin / CPhysicalRightAntiSemiHashJoin and xforms CXformLeftSemiJoin2RightSemiHashJoin / CXformLeftAntiSemiJoin2RightAntiSemiHashJoin (appended to the CXform enum to keep ABI stable), DXL operators/tokens (EdxljtRightSemijoin / EdxljtRightAntiSemijoin) and cost functions CostRightSemiHashJoin / CostRightAntiSemiHashJoin. - CTranslatorDXLToPlStmt swaps the build/probe side (fSwapBuildSide) so the LHS becomes the executor's inner/Hash child, mapping ORCA's plan onto the existing GPDB JOIN_RIGHT_SEMI/ANTI executor. - GUC optimizer_enable_right_semi_join (default on, registered in unsync_guc_name.h) gates the xforms via CConfigParamMapping; serves as a kill-switch and the on/off toggle for TPC-H Q21 before/after comparisons. Bug fixes folded in: - Avoid a SIGSEGV in ExecInitPartitionSelector: right-semi must not host a join-driven Partition Selector (dynamic partition elimination), since the build side is swapped; it now inherits CPhysical's base Ppps behaviour and a partitioned semijoin falls back to the regular Hash Semi Join. - Do not build a right semi/anti hash join for null-aware (IS NOT DISTINCT FROM) predicates (EXCEPT / NOT IN): those can reference types with a hash operator family but no usable hash function (e.g. money), which fails at plan time; fall back to the nested-loop path. - Relax the CostRightSemiHashJoin op-id assert (also reached via CostRightAntiSemiHashJoin) so GPOS_DEBUG builds do not abort. Add the ASF license header to the new GPORCA source files (Apache RAT).
7f06ddc to
661a48e
Compare
Add src/test/regress/sql/rightsemijoin.sql (registered in greenplum_schedule) with planner and GPORCA answer files asserting that optimizer_enable_right_semi_join flips the plan between Hash Right Semi/Anti Join and the regular Hash Semi/Anti Join, with identical results. Regenerate the answer files affected by the new Hash Right Semi/Anti Join plan shapes across the core, PAX (contrib/pax_storage) and singlenode regress suites (both planner .out and GPORCA _optimizer.out), including new partition_append_optimizer.out / partition_join_optimizer.out / rpt_joins_optimizer.out where GPORCA now diverges from the planner. Results are unchanged; only EXPLAIN plan shapes differ.
661a48e to
e0e62f4
Compare
What
Backports two upstream PostgreSQL commits that add Hash Right Semi Join:
aa86129e1— Support "Right Semi Join" plan shapes5668a857d— Fix right-semi-joins in HashJoin rescansThis lets the planner build the hash table on the smaller (LHS) side of an
IN/EXISTSsemijoin instead of always hashing the inner relation.Cloudberry/GPDB-specific adaptations
JOIN_RIGHT_SEMIis appended at the end of theJoinTypeenum rather than in upstream's mid-list position. Inserting it mid-list
shifts the integer values of the GPDB-only
JOIN_DEDUP_SEMI/_REVERSEandJOIN_UNIQUE_*codes, which corrupts MPP motion planning and crashes duringdispatch (SIGSEGV in
setupCdbProcessList). Appending keeps every existingvalue stable.
cdbpath_motion_for_join, serial + parallel switches):handle
JOIN_RIGHT_SEMIlikeJOIN_RIGHT/JOIN_RIGHT_ANTI— the inner(build) side must not be replicated, since a right-semi join emits
build-side rows.
JOIN_RIGHT_SEMIpath alongsideJOIN_SEMIwhilepreserving the existing GPDB
JOIN_DEDUP_SEMIhandling.Testing
Hash Right Semi Joinis chosen for small-build-side semijoins;results verified correct (dedup semantics, rescan correctness, MPP execution
across segments).
jointest expected output is still being reconciledagainst CI's canonical environment — hence this PR is opened as a draft.
Notes
optimizer=off); GPORCA does notgenerate
JOIN_RIGHT_SEMI.