feat(experimental): Add schema change support for BigQuery#499
Merged
Conversation
iambriccardo
commented
Apr 13, 2026
iambriccardo
commented
Apr 15, 2026
farazdagi
approved these changes
Apr 17, 2026
Contributor
farazdagi
left a comment
There was a problem hiding this comment.
Absolutely amazing work. I've done a first pass and the architecture looks solid.
Most of the comments are just nitpicks, so feel free to ignore when resolving.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds experimental schema-change support for BigQuery pipelines.
At a high level, the pipeline can now observe Postgres schema changes, persist versioned table schemas in its own state store, and use that information to decide what schema should exist in the destination at a given point in the replication stream.
How It Works
Schema changes are captured transactionally from Postgres with a DDL event trigger. When an
ALTER TABLEhappens, the trigger emits a logical replication message in the same transaction, so schema changes stay ordered with the row-level changes that follow.The pipeline stores table schemas as versioned records keyed by
snapshot_id. The initial schema is stored at0/0, and later schema versions use the LSN associated with the DDL change. This gives the system deterministic schema indexing: every DDL message identifies the schema version by the LSN carried in the stream. On startup or recovery, the pipeline loads the latest schema version whosesnapshot_idis less than or equal to the current flush position.To avoid relying on the destination catalog as the source of truth, this PR replaces legacy table mappings with destination table metadata. That metadata stores the last applied schema version, the previous version, and a replication mask describing which columns are actively replicated. Destinations receive replicated schema information with events, so they can diff and apply schema changes directly instead of reloading schema ad hoc.
Key Design Points
Consistent Schema Loading
This PR makes schema loading consistent by using the same schema-description path for both the initial load and later schema updates. That means the schema shape the pipeline stores and the schema shape it later uses for diffing come from the same source of truth, which reduces drift between bootstrap and steady-state replication.
Replication Masks
A key structure in the design is the replication mask. Stored schemas can contain the full table definition, but not every column is necessarily replicated because publications can apply column-level filtering. The replication mask tells the system which columns in a schema version are actually active for replication.
That separation is important because it lets us:
Migration Assumption
The storage migrations in this PR are intentionally designed as a one-way upgrade to keep the rollout simple and fast. In practice, that means multiple pipelines pointing at the same source database cannot safely run with mixed old and new state-store code.
If a source database has more than one pipeline using these ETL tables, they need to be upgraded together. Otherwise, an older pipeline can stop working after the migrated schema is used by a newer pipeline, for example on the next schema-store write, state-store write, or schema change.
Crash-Safe Properties
The crash-safety model is based on the assumption that we always know which schema the destination should have by looking at persisted metadata, not by inspecting the destination itself.
The metadata stores which schema version is currently applied, which version was applied before it, and a status that signals whether a schema transition completed. Because schema versions are indexed deterministically by LSN-backed
snapshot_id, the system can recover by reloading the schema version for the current replication position and comparing it with the metadata for the destination.Today, the status field is intentionally simple. In practice, it is mainly a signal for whether a schema transition finished cleanly or whether the destination should be treated as corrupted. If a destination table is left in
applying, the current interpretation is that the schema change may have failed partway through and the destination is no longer trustworthy without cleanup or restart.This is a pragmatic first step rather than a full repair model. Different destinations can later implement more sophisticated repair or reconciliation flows, but for now the engine mainly uses the status to detect potentially corrupted state and stop pretending the destination is healthy.
BigQuery still does not provide fully atomic multi-statement DDL, so schema application can fail partway through. This PR makes that state explicit through metadata such as
appliedandapplying, but the exact recovery and repair semantics should be defined more clearly in follow-up work.What Changed
ALTER TABLEchanges in the replication stream with a DDL event triggersnapshot_idetl.table_columnslayout, includingprimary_key_ordinal_positionValidation
cargo test -p etl-api --test main --all-features pipelines:: -- --nocaptureNext Steps
applyingstate and partial schema application, including what cleanup or repair guarantees the engine should provide