Skip to content

[SS-187] Add support for replication from a physical postgres replica#37020

Open
peterdukelarsen wants to merge 6 commits into
MaterializeInc:mainfrom
peterdukelarsen:pl/physical-replication-validation
Open

[SS-187] Add support for replication from a physical postgres replica#37020
peterdukelarsen wants to merge 6 commits into
MaterializeInc:mainfrom
peterdukelarsen:pl/physical-replication-validation

Conversation

@peterdukelarsen

@peterdukelarsen peterdukelarsen commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Motivation

We would like to be able to set up replication against a physical Postgres replica. This fails hard today because you can't read pg_current_wal_lsn from a standby.

Relevant issue: https://linear.app/materializeinc/issue/SS-187/replication-from-postgres-replica

This is adapted from #36923.

Description

  • Record during purification whether the connection is to a physical replica
  • Use the appropriate method of loading the LSN depending on whether it's a physical replica or not

Tradeoffs to call out

  • Replicating from a primary without hot_standby_feedback=on on the replica risks snapshots timing out (potentially quickly) and increases the risk of the replication slot being terminated.
  • In the test case where we stall, we'll hold a slot open on the replica. With hot_standby_feedback=on this risks bloating the disk of the primary. I think setting max_slot_wal_keep_size to a reasonably high level mitigates the scariest versions of this. This style of tradeoff seems somewhat similar to existing trade-offs around max_slot_wall_keep_size settings.
  • RTR is no longer accurate relative to the primary postgres replica. This should be ok and similar to cascading logical replication which we support today.
  • Materialize sources will require a fresh snapshot in the case of failover -- this is similar to logical replication today where a timeline id change produces an error. This protects us from corruption.

Verification

  • Testdrive tests covering reading from a replica, basic RTR queries, and ensuring we stall if the same replica is recovered from a more recent backup

@peterdukelarsen peterdukelarsen force-pushed the pl/physical-replication-validation branch from 5ffe092 to 39d0ca8 Compare June 15, 2026 17:07
@peterdukelarsen peterdukelarsen force-pushed the pl/physical-replication-validation branch from f3ddf99 to 2c96a04 Compare June 15, 2026 20:47
@peterdukelarsen peterdukelarsen changed the title Draft postgres physical replication validation [SS-187] Add support for replication from a physical postgres replica Jun 15, 2026
@peterdukelarsen peterdukelarsen marked this pull request as ready for review June 15, 2026 23:38
@peterdukelarsen peterdukelarsen requested review from a team as code owners June 15, 2026 23:38
}
}

async fn ensure_physical_replica(

@peterdukelarsen peterdukelarsen Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is modeled after how we check the "timeline_id" and will proactively error out the source if the replica status changes. My biggest concern would be if there's an unexpected transient moment where the primary returns true for pg_is_in_recovery, which would make the failure permanent.

That said, I'm pretty confident that a primary server should never return true for is_physical_replica -- even during crash recovery when it would be true, it doesn't accept connections.

All that said, this is a relatively risky point and I could be convinced to downgrade to a transient error, or to just log an error/warning to derisk the case where we unexpectedly get a true. I could also branch based on expected_is_physical_replica and be more restrictive only when it was recorded as a physical replica initially (the new case).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having done some research, this does not appear possible, so I have no objection leaving it as a definite error.

@martykulma martykulma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A couple of questions for inline. It doesn't look like there is test coverage for promoting the standby, is that possible to add?

(Some(is_physical_replica), Some(is_physical_replica_other)) => {
is_physical_replica == is_physical_replica_other
}
(None, Some(_)) => true,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this allow the case of (None, Some(true))? I read that as we went from upstream is not a replica (because it was not possible) to replica.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting point. I think starting out with an error seems reasonable and more defensive, so will do that - good catch.

Comment thread src/postgres-util/src/replication.rs Outdated

Err(PostgresError::Generic(anyhow::anyhow!(
"pg_current_wal_lsn() mysteriously has no value"
"WAL LSN mysteriously has no value"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is_physical_standby path produces a None, should this call get_is_in_recovery(..) to determine if a failover has occurred and report a more descriptive error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll add that! I actually missed that pg_last_wal_replay_lsn is nullable. I assumed it would error like pg_current_wal_lsn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants