Wednesday, September 22, 2010

A Replication Surprise

While working on a deployment we came across a nasty surprise. In hindsight it was avoidable, but it never crossed our minds it could happen. I'll share the experience so when you face a similar situation, you'll know what to expect.

Scenario

To deploy the changes, we used a pair of servers configured to replicate with each other (master-master replication). There are many articles that describe how to perform an ALTER TABLE with minimum or no downtime using MySQL replication. The simple explanation is:
  1. Set up a passive master of the database you want to modify the schema. 
  2. Run the schema updates on the passive master.
  3. Let replication to catch up once the schema modifications are done.
  4. Promote the passive master as the new active master.
The details to make this work will depend on each individual situation and are too extensive for the purpose of this article. A simple Google search will point you in the right direction.

The Plan

The binlog_format variable was set to MIXED. While production was still running on the active master, we stopped replication from the passive to the active master so we would still get all the DML statements on the passive master while running the alter tables. Once the schema modifications were over, we could switch the active and passive masters in production and let the new passive catch up with the table modifications once the replication thread was running again.

The ALTER TABLE statement we applied was similar to this one:
ALTER TABLE tt ADD COLUMN cx AFTER c1;
There were more columns after cx and c1 was one of the first columns. Going through all the ALTER TABLE statements takes almost 2 hour, so it was important to get the sequence of event right.

Reality Kicks In

It turns out that using AFTER / BEFORE or changing column types broke replication when it was writing to the binlog files in row based format, which meant that we couldn't switch masters as planned until we had replication going again. As a result we had to re-issue an ALTER TABLE to revert the changes and then repeat them without the AFTER / BEFORE.

The column type change was trickier and could've been a disaster, fortunately this happened on a small table (~400 rows which meant the ALTER TABLE took less than 0.3sec). In this case we reverted the modification on the passive master and run the proper ALTER TABLE on the active master. Should this have happened with a bigger table, there was no other alternative than either rollback the deployment or deal with the locked table while the modification happened.

Once this was done we were able to restart the slave threads, let it catch up and and everything was running as planned ... but with a 2hr delay.

Unfortunately, using STATEMENT replication wouldn't work in this case for reasons that would need another blog article to explain.

Happy Ending

After the fact, I went back to the manual and I found this article: Replication with Differing Table Definitions on Master and Slave. I guess we should review the documentation more often, the changes happened after 5.1.22. I shared this article with the development team, so next time we won't have surprises.