Skip to content

Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799

Open
kongfanshen-0801 wants to merge 5 commits into
apache:mainfrom
kongfanshen-0801:feature/right-semi-join-backport
Open

Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799
kongfanshen-0801 wants to merge 5 commits into
apache:mainfrom
kongfanshen-0801:feature/right-semi-join-backport

Conversation

@kongfanshen-0801

Copy link
Copy Markdown
Contributor

What

Backports two upstream PostgreSQL commits that add Hash Right Semi Join:

  • aa86129e1 — Support "Right Semi Join" plan shapes
  • 5668a857d — Fix right-semi-joins in HashJoin rescans

This lets the planner build the hash table on the smaller (LHS) side of an
IN/EXISTS semijoin instead of always hashing the inner relation.

Cloudberry/GPDB-specific adaptations

  • nodes.h: JOIN_RIGHT_SEMI is appended at the end of the JoinType
    enum rather than in upstream's mid-list position. Inserting it mid-list
    shifts the integer values of the GPDB-only JOIN_DEDUP_SEMI/_REVERSE and
    JOIN_UNIQUE_* codes, which corrupts MPP motion planning and crashes during
    dispatch (SIGSEGV in setupCdbProcessList). Appending keeps every existing
    value stable.
  • cdbpath.c (cdbpath_motion_for_join, serial + parallel switches):
    handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI — the inner
    (build) side must not be replicated, since a right-semi join emits
    build-side rows.
  • joinrels.c: add the JOIN_RIGHT_SEMI path alongside JOIN_SEMI while
    preserving the existing GPDB JOIN_DEDUP_SEMI handling.

Testing

  • Functional: Hash Right Semi Join is chosen for small-build-side semijoins;
    results verified correct (dedup semantics, rescan correctness, MPP execution
    across segments).
  • Regression: the join test expected output is still being reconciled
    against CI's canonical environment — hence this PR is opened as a draft.

Notes

  • Only affects the PostgreSQL planner (optimizer=off); GPORCA does not
    generate JOIN_RIGHT_SEMI.

@kongfanshen-0801 kongfanshen-0801 marked this pull request as ready for review June 3, 2026 08:26
@my-ship-it

Copy link
Copy Markdown
Contributor

Thanks for the back porting. Could we keep the original commit history, thanks

@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from a869ab3 to 4328d2f Compare June 4, 2026 02:04
@kongfanshen-0801

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've restructured the branch to preserve the original commit history:

  • The two upstream PostgreSQL commits are now cherry-picked with their original author (Richard Guo), original commit messages, and (cherry picked from commit ...) provenance lines preserved:
    • Support "Right Semi Join" plan shapes (aa86129)
    • Fix right-semi-joins in HashJoin rescans (5668a85)
  • The Cloudberry/Greenplum-specific adaptations (moving JOIN_RIGHT_SEMI to the end of the JoinType enum to keep the GPDB enum values ABI-stable, and handling the new join type in cdbpath_motion_for_join) are isolated in a separate follow-up commit.

The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks!

@my-ship-it

Copy link
Copy Markdown
Contributor

Thanks for the review! I've restructured the branch to preserve the original commit history:

  • The two upstream PostgreSQL commits are now cherry-picked with their original author (Richard Guo), original commit messages, and (cherry picked from commit ...) provenance lines preserved:

    • Support "Right Semi Join" plan shapes (aa86129)
    • Fix right-semi-joins in HashJoin rescans (5668a85)
  • The Cloudberry/Greenplum-specific adaptations (moving JOIN_RIGHT_SEMI to the end of the JoinType enum to keep the GPDB enum values ABI-stable, and handling the new join type in cdbpath_motion_for_join) are isolated in a separate follow-up commit.

The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks!

Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21

@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch 4 times, most recently from 5dc8186 to 8ad6a81 Compare June 17, 2026 10:49
@kongfanshen-0801

Copy link
Copy Markdown
Contributor Author

Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21

Done — ran TPC-H Q21 before/after on the optimizer_enable_right_semi_join GUC.

Setup: SF=100 (lineitem 600,037,902 rows), GPORCA, 3-segment single-host demo cluster, 3 timed runs each.

optimizer_enable_right_semi_join semijoin plan runs (s) best
on Hash Right Semi Join 355.8 / 340.1 / 339.2 339.2
off Hash Semi Join 353.0 / 349.3 / 352.2 349.3

Q21's EXISTS (l2 ...) semijoin is where the two plans diverge:

  • onHash Right Semi Join: builds the hash on the small LHS (448,837 rows, 2 batches, ~17 MB) and streams lineitem l2 (200M rows).
  • offHash Semi Join: builds the hash on lineitem l2 (200,042,924 rows, 512 batches → heavy spill), Work_mem wanted: 8.6 GB.

So the right-semi shape shrinks the build side from ~200M to ~449K rows, cuts the semijoin hash spill from 512 → 2 batches, and drops requested work_mem from ~8.6 GB → ~14 MB.

End-to-end wall-clock is ~3% better here because on this small cluster the total is dominated by the three lineitem scans (l1/l2/l3) and the l3 anti-join spill that are common to both plans; the build-side / memory win should grow on larger, properly-sized clusters and with tighter memory settings.

I also added an ORCA regression test (src/test/regress/sql/rightsemijoin.sql, with both planner and _optimizer answer files) asserting the GUC flips the plan between right-semi/anti and the regular semi/anti shapes with identical results, and wrote the Q21 numbers into the commit message. PTAL, thanks!

@kongfanshen-0801

Copy link
Copy Markdown
Contributor Author

Follow-up on the Q21 numbers above — clarifying why the end-to-end gain is only ~3%, and showing the isolated win.

Where Q21's time actually goes (row executor)

Q21 scans lineitem three times and the cost is dominated by parts that are identical with the GUC on or off:

Hash Right Semi Join (l1 ⋈ l2)        <- the ONLY node right-semi changes
├─ Seq Scan lineitem l2 = 200,042,924
└─ Hash Anti Join (l1 ⋈ l3)
    ├─ Hash Join (l1 ⋈ supplier)
    │   └─ Seq Scan lineitem l1 = 126,464,414
    └─ Hash(build l3) = 126,464,414, Batches: 256 (spills)
        └─ Seq Scan lineitem l3 = 126,464,414
  1. Three lineitem scans + per-row tuple deform ≈ 452M rows (~3 × 74 GB I/O plus deforming 16 columns/row in the row executor) — the single biggest chunk.
  2. The l3 anti-join builds a 126M-row hash that spills to 256 batches — a separate large hash join that right-semi does not touch; identical on both sides.
  3. The two big joins' probe work (hashing/comparing 200M + 126M rows + the l_suppkey <> l_suppkey join filter).

Right-semi only changes the build side of the outer semijoin (l1 ⋈ l2):

semijoin build side spill work_mem wanted
on small LHS, 448,837 rows, 2 batches, in-mem none 14 MB
off lineitem l2, 200,042,924 rows, 512 batches ~heavy 8.6 GB

The 200M l2 rows are scanned either way (as probe when on, as build when off), so the only net delta is the cost of building + spilling that 200M-row hash. Relative to the 452M-row scans and the l3 256-batch spill, that delta is small → ~3% wall-clock, even though total "Memory wanted" drops 60 GB → 38 GB.

Isolated semijoin: ~4× when it's the dominant cost

Removing the 3× full scans (small 10k-row LHS vs a unique 150M-row RHS, orders.o_orderkey, GPORCA), so the semijoin is the cost:

optimizer_enable_right_semi_join plan exec time (2 runs)
on Hash Right Semi Join — build on 3,385-row LHS, no spill 19.7 / 19.7 s
off fallback HashAggregate(dedup 50M) + Hash Join — 641 batches, 1.5 GB spill, 4.6 GB wanted 77.1 / 78.4 s

~3.9× faster (≈20 s vs ≈77 s) once the build-side choice is the bottleneck.

Row vs vectorized

In the row executor, steps 1 + 3 (scanning/deforming ~452M rows and row-at-a-time hash/probe) dominate and dilute the win. In a vectorized/columnar engine those steps get much cheaper (column-pruned, batched), so the relative weight of avoiding the 200M-row build+spill grows and the Q21 wall-clock gain becomes more pronounced — consistent with the larger Q21 improvement seen on the vectorized build this was ported from.

Summary: the right-semi plan is genuinely chosen and the per-operator win is large (~4× isolated, 200M→449K build, 512→2 batches, 8.6 GB→14 MB); it just sits behind scan/deform-bound work in Q21 on the row executor, hence ~3% end-to-end here. optimizer_enable_right_semi_join defaults to on.

@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch 2 times, most recently from 34dfda2 to b5aeefd Compare June 18, 2026 03:21
@kongfanshen-0801

Copy link
Copy Markdown
Contributor Author

Re-ran Q21 (SF=100) at a production-level statement_mem (7 GB) instead of the small default, so the build-side difference shows up as a spill that the right-semi plan avoids:

optimizer_enable_right_semi_join semijoin build hash l3 anti-join (common) time (2 runs) Memory wanted
on 449K rows — 1 batch, no spill 126M rows, 4 batches 354.0 / 355.6 s 38 GB
off lineitem l2 200M rows — 8 batches, spills 1.14 GB 126M rows, 4 batches 388.3 / 397.7 s 60 GB

At realistic memory the off plan spills the 200M-row semijoin build (8 batches), while on keeps it fully in memory (1 batch) → ~10% end-to-end and total Memory wanted 60 GB → 38 GB. (At the tiny default statement_mem both plans spill heavily, which compresses the gap to ~3%.)

The end-to-end gain is bounded because Q21 is scan-bound in the row executor: the three lineitem scans (l1/l2/l3, ~452M rows) and the common l3 anti-join spill dominate runtime regardless of the GUC. Isolating the semijoin (small LHS vs a unique 150M-row RHS, so de-dup doesn't help the off plan) shows the operator-level win directly: ~20 s (on) vs ~77 s (off) ≈ 3.9×.

On bumping the scale further: SF=500 isn't feasible on this box — the loaded SF=100 set is already ~216 GB, so SF=500 would be ~1 TB+ of table data alone and exceeds the disk. More importantly, in the row executor a larger SF mostly scales both plans together (still scan-bound), so it wouldn't move the ratio much; the spill-avoidance is best demonstrated by the build-side-vs-statement_mem relationship above. The relative win grows in a vectorized executor where the common scan / tuple-deform cost is far cheaper.

@kongfanshen-0801

Copy link
Copy Markdown
Contributor Author

Larger-scale performance comparison (SF≈300, GPORCA)

Loaded a larger TPC-H data set (lineitem 1.8B rows, orders 450M rows) to show the right-semi win at scale and the spill it avoids.

Note on Q21 itself at this scale: with 1.8B-row lineitem, GPORCA does not pick a right-semi join for Q21's EXISTS (lineitem l2 …) — both sides of that semijoin are billion-row tables, so the cost model correctly keeps the regular Hash Semi Join (building the hash on the smaller LHS only pays off when the LHS is actually small). That is the cost model behaving sensibly, not a regression.

To isolate the operator where right-semi is the right choice — a small probe-side driving table against a large, unique build key — I used orders.o_orderkey (450M unique rows) probed by a 10K-row table:

select count(*) from skeys s where s.k in (select o_orderkey from orders);  -- 10K vs 450M
optimizer_enable_right_semi_join plan build side spill exec time (2 runs)
on Hash Right Semi Join 3,385-row LHS, 1 batch, no spill, 4 MB none 62 s / 66 s
off HashAggregate (dedup) + Hash Join dedup 150M rows, 1281 batches, disk 4.1 GB, work_mem wanted 16.9 GB 4.1 GB 285 s / 297 s

~4.6× faster (≈62 s vs ≈285 s). The off plan must de-duplicate the 450M-row side, spilling 4.1 GB across 1281 batches and asking for ~16.9 GB of work_mem; the right-semi plan builds the hash on the 3,385-row LHS, stays in memory (1 batch, 4 MB) and just streams orders.

The win grows with scale (≈3.9× at SF=100 → ≈4.6× here) and, more importantly, eliminates the large-side dedup spill entirely — which is exactly the case this backport targets.

(SF=500 was attempted but the loaded heap tables exceed the 1 TB test disk before the load completes — physical size is ~2× the logical size — so this SF≈300 run is the largest that fits here.)

Richard Guo and others added 3 commits June 30, 2026 10:06
Hash joins can support semijoin with the LHS input on the right, using
the existing logic for inner join, combined with the assurance that only
the first match for each inner tuple is considered, which can be
achieved by leveraging the HEAP_TUPLE_HAS_MATCH flag.  This can be very
useful in some cases since we may now have the option to hash the
smaller table instead of the larger.

Merge join could likely support "Right Semi Join" too.  However, the
benefit of swapping inputs tends to be small here, so we do not address
that in this patch.

Note that this patch also modifies a test query in join.sql to ensure it
continues testing as intended.  With this patch the original query would
result in a right-semi-join rather than semi-join, compromising its
original purpose of testing the fix for neqjoinsel's behavior for
semi-joins.

Author: Richard Guo
Reviewed-by: wenhui qiu, Alena Rybakina, Japin Li
Discussion: https://postgr.es/m/CAMbWs4_X1mN=ic+SxcyymUqFx9bB8pqSLTGJ-F=MHy4PW3eRXw@mail.gmail.com
(cherry picked from commit aa86129)
When resetting a HashJoin node for rescans, if it is a single-batch
join and there are no parameter changes for the inner subnode, we can
just reuse the existing hash table without rebuilding it.  However,
for join types that depend on the inner-tuple match flags in the hash
table, we need to reset these match flags to avoid incorrect results.
This applies to right, right-anti, right-semi, and full joins.

When I introduced "Right Semi Join" plan shapes in aa86129, I failed
to reset the match flags in the hash table for right-semi joins in
rescans.  This oversight has been shown to produce incorrect results.
This patch fixes it.

Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com
(cherry picked from commit 5668a85)
This commit carries the Cloudberry/Greenplum-specific changes needed on
top of the two cherry-picked upstream commits (aa86129, 5668a85),
which only touch the upstream PostgreSQL planner/executor files.

- nodes.h: move JOIN_RIGHT_SEMI to the END of the JoinType enum. Upstream
  places it next to JOIN_RIGHT_ANTI, but in the Cloudberry tree that shifts
  the integer values of the GPDB-specific JOIN_DEDUP_SEMI/REVERSE and
  JOIN_UNIQUE_* codes. Value-dependent code then corrupts MPP motion
  planning, producing a degenerate plan ("Gather Motion 0:1" /
  "Redistribute Motion 1:0") that crashes with SIGSEGV in
  setupCdbProcessList() during dispatch. Appending keeps every pre-existing
  enum value stable.

- cdbpath.c (cdbpath_motion_for_join, both the serial and parallel switch):
  handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI. A right-semi join
  emits inner (build-side) rows, so the inner side must not be replicated,
  otherwise matched inner rows could be emitted more than once. Without this
  the new join type would hit the switch default and elog(ERROR,
  "unexpected join type") at plan time.

Note: this feature is only exercised by the PostgreSQL planner
(optimizer=off); GPORCA does not generate JOIN_RIGHT_SEMI.
@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from ecd10e7 to 2fa5d61 Compare June 30, 2026 02:10
Teach GPORCA to produce the PostgreSQL-18 Hash Right Semi / Right Anti Join
plan shapes (build the hash table on the smaller left-hand side of a semi/anti
join and probe with the larger right side), matching the Postgres planner
backport.

- New physical operators CPhysicalRightSemiHashJoin / CPhysicalRightAntiSemiHashJoin
  and xforms CXformLeftSemiJoin2RightSemiHashJoin /
  CXformLeftAntiSemiJoin2RightAntiSemiHashJoin (appended to the CXform enum to
  keep ABI stable), DXL operators/tokens (EdxljtRightSemijoin /
  EdxljtRightAntiSemijoin) and cost functions CostRightSemiHashJoin /
  CostRightAntiSemiHashJoin.
- CTranslatorDXLToPlStmt swaps the build/probe side (fSwapBuildSide) so the LHS
  becomes the executor's inner/Hash child, mapping ORCA's plan onto the existing
  GPDB JOIN_RIGHT_SEMI/ANTI executor.
- GUC optimizer_enable_right_semi_join (default on, registered in
  unsync_guc_name.h) gates the xforms via CConfigParamMapping; serves as a
  kill-switch and the on/off toggle for TPC-H Q21 before/after comparisons.

Bug fixes folded in:
- Avoid a SIGSEGV in ExecInitPartitionSelector: right-semi must not host a
  join-driven Partition Selector (dynamic partition elimination), since the
  build side is swapped; it now inherits CPhysical's base Ppps behaviour and a
  partitioned semijoin falls back to the regular Hash Semi Join.
- Do not build a right semi/anti hash join for null-aware (IS NOT DISTINCT FROM)
  predicates (EXCEPT / NOT IN): those can reference types with a hash operator
  family but no usable hash function (e.g. money), which fails at plan time;
  fall back to the nested-loop path.
- Relax the CostRightSemiHashJoin op-id assert (also reached via
  CostRightAntiSemiHashJoin) so GPOS_DEBUG builds do not abort.

Add the ASF license header to the new GPORCA source files (Apache RAT).
@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from 7f06ddc to 661a48e Compare June 30, 2026 07:18
Add src/test/regress/sql/rightsemijoin.sql (registered in greenplum_schedule)
with planner and GPORCA answer files asserting that
optimizer_enable_right_semi_join flips the plan between Hash Right Semi/Anti
Join and the regular Hash Semi/Anti Join, with identical results.

Regenerate the answer files affected by the new Hash Right Semi/Anti Join plan
shapes across the core, PAX (contrib/pax_storage) and singlenode regress suites
(both planner .out and GPORCA _optimizer.out), including new
partition_append_optimizer.out / partition_join_optimizer.out /
rpt_joins_optimizer.out where GPORCA now diverges from the planner.  Results are
unchanged; only EXPLAIN plan shapes differ.
@kongfanshen-0801 kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from 661a48e to e0e62f4 Compare June 30, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants