Recoverable, failover agnostic migrations
Vitess's managed schema changes offer failover agnostic migrations in online
strategy (VReplication based).
Normally, schema migrations are coupled with the original MySQL server they operate on. A gh-ost
or a pt-online-schema-change
, as well as plain direct migrations, may only complete on the same server where they started. Any form of failover, whether planned or unplanned, either breaks the migration or makes it obsolete.
online
strategy migrations are agnostic to server promotion. A migration can begin on one primary
tablet, and complete on another tablet which was promoted as primary
throughout the migration. In large part this is a direct result of the nature of VReplication.
online
migrations will auto-survive:
- A planned failover (via PlannedReparentShard)
- An emergency reparent (EmergencyReparentShard)
- An unexpected external reparent
- As long as no more than
10
minutes pass between failure/demotion of previousprimary
tablet and the promotion of the newprimary
tablet.
Behavior and limitations #
Whether by planned operation or an unplanned failure, an online
migration's VReplication stream is interrupted while copying/applying data. VReplication's mechanism persists the state of data transfer transactionally with the transfer itself. Any replica will have a consistent state of the migration, even if that replica lags behind the primary.
When a replica tablet is promoted as primary
, it notices the VReplication stream, which is meant to be active and running. It sets up the connections and processes to resume its work. It is possible that some retries will take place as the stream re-evaluates its source of data.
The Online DDL Scheduler detects the running stream, and identifies it as having been created by a different tablet. It assumes ownership of the stream and proceeds to follow its progress till completion.
The stream must be no more than 10
minutes stale, otherwise the scheduler marks the migration as failed.
There is no limitation on the number of failovers an online
migration can survive.
No user action is required. Immediately after promotion/failover the migration will present as making no progress. It is likely to present progress within 1 or 2 minutes after promotion.