[SS-187] Add support for replication from a physical postgres replica#37020
[SS-187] Add support for replication from a physical postgres replica#37020peterdukelarsen wants to merge 6 commits into
Conversation
5ffe092 to
39d0ca8
Compare
f3ddf99 to
2c96a04
Compare
| } | ||
| } | ||
|
|
||
| async fn ensure_physical_replica( |
There was a problem hiding this comment.
This check is modeled after how we check the "timeline_id" and will proactively error out the source if the replica status changes. My biggest concern would be if there's an unexpected transient moment where the primary returns true for pg_is_in_recovery, which would make the failure permanent.
That said, I'm pretty confident that a primary server should never return true for is_physical_replica -- even during crash recovery when it would be true, it doesn't accept connections.
All that said, this is a relatively risky point and I could be convinced to downgrade to a transient error, or to just log an error/warning to derisk the case where we unexpectedly get a true. I could also branch based on expected_is_physical_replica and be more restrictive only when it was recorded as a physical replica initially (the new case).
There was a problem hiding this comment.
Having done some research, this does not appear possible, so I have no objection leaving it as a definite error.
martykulma
left a comment
There was a problem hiding this comment.
Looking good! A couple of questions for inline. It doesn't look like there is test coverage for promoting the standby, is that possible to add?
| (Some(is_physical_replica), Some(is_physical_replica_other)) => { | ||
| is_physical_replica == is_physical_replica_other | ||
| } | ||
| (None, Some(_)) => true, |
There was a problem hiding this comment.
Should this allow the case of (None, Some(true))? I read that as we went from upstream is not a replica (because it was not possible) to replica.
There was a problem hiding this comment.
Oh, interesting point. I think starting out with an error seems reasonable and more defensive, so will do that - good catch.
|
|
||
| Err(PostgresError::Generic(anyhow::anyhow!( | ||
| "pg_current_wal_lsn() mysteriously has no value" | ||
| "WAL LSN mysteriously has no value" |
There was a problem hiding this comment.
If is_physical_standby path produces a None, should this call get_is_in_recovery(..) to determine if a failover has occurred and report a more descriptive error?
There was a problem hiding this comment.
Good catch, I'll add that! I actually missed that pg_last_wal_replay_lsn is nullable. I assumed it would error like pg_current_wal_lsn.
Motivation
We would like to be able to set up replication against a physical Postgres replica. This fails hard today because you can't read pg_current_wal_lsn from a standby.
Relevant issue: https://linear.app/materializeinc/issue/SS-187/replication-from-postgres-replica
This is adapted from #36923.
Description
Tradeoffs to call out
hot_standby_feedback=onon the replica risks snapshots timing out (potentially quickly) and increases the risk of the replication slot being terminated.hot_standby_feedback=onthis risks bloating the disk of the primary. I think setting max_slot_wal_keep_size to a reasonably high level mitigates the scariest versions of this. This style of tradeoff seems somewhat similar to existing trade-offs around max_slot_wall_keep_size settings.Verification