Vitess is a solution that allows you to infinitely scale MySQL while providing clients and apps with a single logical
view of the fleet of MySQL instances comprising any number of
Keyspaces
and Shards
.
Vitess also provides the cluster and data management tools that make it possible to manage a massive cluster and
perform complex workflows using VReplication, such
as:
Data is typically one of the most critical assets within an organization. As such, an operator needs to be able to
verify the correctness of this data, in particular as the data is moved around or otherwise transformed. For example,
operators have wanted a way to verify data consistency after replicating data from one MySQL instance to another or
dumping a table from one instance and loading it in another. However, even for a single table in these simplest of
cases — performing a safe, reliable, light-weight, and performant online diff between two MySQL instances is a
suprisingly difficult problem. Due to the challenges involved, there have been few attempted general solutions with
the most notable being:
With Vitess, the need for a data diff tool is even more pronounced because you'll be migrating data from your
legacy systems into Vitess, migrating data across keyspaces, and performing a variety of other workflows. This
is further complicated by the fact that these workflows may be done across MySQL versions, data centers, with
differing schemas between the source and target, and over long time periods in which your data evolves. So it
is critical to have a tool that can reliably perform a logical diff between the source and target of these
workflows, in a timely manner, and without impacting production traffic.
Vitess provided a solution for diffing tables that are part of a VReplication
workflow called VDiff. I will walk through the basic algorithm or flow used for diffing
each table in order to demonstrate the challenges involved and how we solved them in VDiff:
vtctld
selects tablets in the source and target
shards to use for the comparison — one per shard on each side- On the target tablets: stop the VReplication workflow for the VDiff
operation, to "freeze" the state, and record the current
GTID position in the
VStream
- On the source tablets:
- wait for replication to catch up to at least where the target is (remember that the source instance may be a replica and the target
a primary)
- lock the table to get the current
GTID_EXECUTED
which gives us a logical
point in time that will correspond to the the read view in our upcoming transaction - issue
START TRANSACTION WITH CONSISTENT SNAPSHOT
- unlock the table as we now have a consistent snapshot of
the table data and the GTID metadata that are both at the same logical point in time with regards to the table
we're diffing
- On the target tablets:
- start VReplication UNTIL we have reached that
GTID_EXECUTED
position in the VStream
which matches the one we saved when setting up the read view on the source - issue
START TRANSACTION WITH CONSISTENT SNAPSHOT
(remember
that the state is "frozen" on the target tablet) — now the target context is at the same logical point in
time as the source for this table
- On the source and target tablets: issue
SELECT <cols> FROM <table> ORDER BY <pkcols>
(for deterministic ordering and to avoid a filesort) - In
vtctld
: stream the results from those SELECTs, doing a
merge sort from the shards, and compare the rows on both sides logically, as the schema may be different on either
side, keeping a record of any differences seen - On the target tablets: restart the VReplication workflow
- On the source and target tablets: close the open transaction with
a
ROLLBACK
- Finally the
vtctl
client prints a report (to STDOUT) of the
results
For large tables, holding a transaction open on the source tablets can have a significant impact on normal query
traffic due to
InnoDB MVCC needing to keep those
older versions of rows around if they are updated after the transaction started
(
innodb_history_list_length
). For
this reason, I would recommend always using REPLICA tablets for VDiff operations whenever you can (when the source is an
unmanaged tablet, such as when e.g. moving
from RDS into Vitess, you may only have a PRIMARY tablet available). You can control that using the
--tablet_types=REPLICA
flag for the
VDiff command. In v14+ the
default was changed to:
--tablet_types=in_order:RDONLY,REPLICA,PRIMARY
.
The original version worked very well but it suffered from some limitations
that posed challenges in certain situations such as when working
with very large tables.
For example, if you have over 1TiB of data that needs to be compared the VDiff could take a week to complete. If
during this time you had any failure such as one of the MySQL connections used getting closed (e.g. due to
wait_timeout
or
net_write_timeout
)
then you'd have to start the entire operation over again from scratch.
We processed feedback from Vitess users over the course of 2+ years as they used VDiff in production and a
set of underlying issues started to become clear:
- Fragility — any connection loss, process failure, failover etc. would cause the VDiff to fail and need to be re-run
- Synchronous command — the vtctl client command would block until the VDiff completed which posed some challenges and
required a stable machine where e.g. a tmux session could be used for the client
call
- VTCtld as the controller — the Vitess cluster management daemon is
generally a lightweight process used to coordinate complex operations that span many Vitess components. It's not designed
to be used for operations that span days and require the resources needed to compare 100s of GiBs of data
- Network traffic — the tablets on each side of the VDiff streamed their
data to the
vtctld
process which then compared the data. This generated a lot of network traffic which could
become a bottleneck and impact overall network bandwidth and latency. Keep in mind that it's common for the data
involved to reside in 3 or more failure domains / availability zones.
- No progress reporting — the VDiff could run for days without any indication of overall progress
- Execution time — the VDiff could take days or weeks to complete for very large tables, in large part because there
was very little concurrency with a single
vtctld
process doing the bulk of the work
We set out to create a new version of VDiff that addressed all of these issues.
We started by largely rearranging the existing VDiff code so that
instead of being managed and controlled by a single vtctld
it's instead managed and executed — in parallel — by the
primary tablet in each shard on the target side. This offers parallelism while also reducing the amount of network traffic
needed to perform the diff. The operation was also made asynchronous, with the
VDiff Show
client command gathering and reporting the results of the VDiff operation from each of the target shards involved.
We then made VDiffs resumable so that if a failure occurs during
the diff, the operation can be resumed from where it left off. This also makes it possible to do a rolling or
incremental VDiff where you may perform the VDiff immediately after a workflow completes, and then again just before
doing a cutover for added confidence as there may be weeks between those two stages. From there we added support for
auto-restarting a VDiff if any ephemeral/recoverable error occurs.
This means that you can have process crashes, failovers, network issues, etc and the VDiff will automatically
recover and continue running.
We also added progress reporting so that you have some idea of
how much work the VDiff has done, how much is left, and have an ETA for when it's likely to complete. This gives
you greater peace of mind while a longer operation runs and better allows you to prepare for the next step once
the VDiff completes.
There were a variety of other minor improvements as well. In total, we hope that this new version addresses the
major set of issues that users had and provides a solid base for us to continue making further improvements.
Vitess VReplication offers a set of
powerful features that allow users to manage data workflows when that data is spread across a large fleet of
MySQL instances. VDiff then provides an invaluable
tool for verifying the correctness of these complex operations, giving you confidence and peace of mind
as you execute the data operations required to better meet your evolving business needs and objectives over
time.
Please try out VDiff v2 in
Vitess 15.0 — where it's marked as experimental —
and provide feedback! We hope to mark it as GA/production-ready in the upcoming 16.0 release and your
feedback is invaluable. Special shout out to Arthur Schreiber @ GitHub
for providing a lot of great early feedback that's helping to make the feature better! ♥️
Happy data migrations! 🚀 🚀 🚀