Skip to content

fix(replication): add proper handling for pool restarts during replication#915

Open
meskill wants to merge 3 commits intomainfrom
fix/replication-pool-shutdown
Open

fix(replication): add proper handling for pool restarts during replication#915
meskill wants to merge 3 commits intomainfrom
fix/replication-pool-shutdown

Conversation

@meskill
Copy link
Copy Markdown
Contributor

@meskill meskill commented Apr 20, 2026

No description provided.

Comment thread pgdog/src/backend/pool/pool_impl.rs Outdated
port = addr.port,
database = %addr.database_name,
user = %addr.user,
"pool offline: connection pool shut down"
Copy link
Copy Markdown
Collaborator

@levkk levkk Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to log that. This "shutdown" isn't actually shutdown, it just destroys this object. We replace it with a brand new one atomically. This could make people think we are shutting down causing panic :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention was to add more logs that will help with tracing the similar issues.

I can drop it or change the severity/message

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup makes sense. You can set this to trace level for sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the logs

// Validate all tables support replication before committing to
// what can be a multi-hour copy. A table with no primary key or
// unique replica-identity index cannot be replicated correctly.
for tables in self.tables.values() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, we should of done it long time ago.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this to show more relevant error and to collect multiple errors

/// Check that the table supports replication.
///
/// Requires at least one column with a replica identity flag. Tables with
/// REPLICA IDENTITY FULL or NOTHING have no identity columns and fail here
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would double check that. If this is true, we need to have a special query to detect REPLICA IDENTITY FULL and use that as the key.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, the added tests validate that. I'll investigate if we can identify this actually

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this will require custom handling in validation and in query generation. Let me actually do this as a separate pr to make it more focused

/// the copy. Instead, only the cluster reference inside the existing
/// publisher is updated so that subsequent pool.get() calls target the
/// live pool rather than a stale, potentially-offline one.
pub(crate) async fn refresh_before_replicate(&mut self) -> Result<(), Error> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have code to do this already somewhere. If not, we should re-use this function wherever we run these 3 statements.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there was the usual refresh but it was resetting the publication as well. I refactored this part and added tests, so now there is only one single refresh method

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

@levkk levkk linked an issue Apr 21, 2026 that may be closed by this pull request
@meskill meskill force-pushed the fix/replication-pool-shutdown branch from a67a82f to 70adcdc Compare April 23, 2026 13:44
@meskill meskill marked this pull request as ready for review April 23, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Resharding] pool is shut down

2 participants