Showing posts with label schema updates. Show all posts
Showing posts with label schema updates. Show all posts

Thursday, May 5, 2011

Some More Replication Stuff

Listening to the OurSQL podcast: Repli-cans and Repli-can’ts got me thinking, what are the issues with MySQL replication that Sarah and Sheeri didn’t have the time to include in their episode. Here’s my list:

Replication Capacity Index

This is a concept introduced by Percona in last year’s post: Estimating Replication Capacity which I revisited briefly during my presentation at this year’s MySQL Users Conference. Why is this important? Very simple: If you use your slaves to take backups, they might be outdated and will fall further behind during the backups. If you use them for reporting, your reports may not show the latest data. If you use it for HA, you may not start writing to it until the slave caught up.
Having said that, measuring replication capacity as you set up slaves is a good way to make sure that the slave servers will be able to catch up with the traffic in the master.

More On Mixed Replication

The podcast also discussed how mixed replication works and pointed to the general criteria that the server applies to switch to STATEMENT or ROW based. However there is one parameter that wasn’t mentioned and it might come back and haunt you: Transaction Isolation Level. You can read all about it in the MySQL Documentation: 12.3.6. SET TRANSACTION Syntax and in particular the InnoDB setting innodb_locks_unsafe for binlog.

Keep Binary Logs Handy

Today I found this article from SkySQL on Planet MySQL about Replication Binlog Backup, which is a really clever idea to keep your binary logs safe with the latest information coming out of the master. It offers a method of copying them without the MySQL server overhead. If you purge binary logs automatically to free space using the variable expire_logs_days, you will still have the logs when you need them for a longer time than your disk capacity on the master might allow.

Seconds Behind Master (SBM)

Again, another topic very well explained in the podcast, but here’s another case where this number will have goofy values. Lets say you have a master A that replicates master-master with server B and server C is a regular slave replicating off A. The application writes to A and B serves as a hot stand-by master.
When we have a deployment that requires DDL and/or DML statements, we break replication going from B to A (A to B keeps running to catch any live transactions) and apply the modifications to B. Once we verify that everything is working OK on B, we switch the application to write to B and restore replication going back to A. This offers a good avenue for rolling back in case the deployment breaks the database in any way (ie: rebuild B using the data in A). What we frequently see is, if the DDL/DML statement takes about 30min (1800 sec) on B, once we restore replication as explained, the slave C will show outrageous numbers for SBM (ie: >12hs behind, I really don’t know how does the SBM arithmetic works to explain this). So it’s a good idea to complement slave drifts monitoring with mk-heartbeat, which uses a timestamp to measure replication drifts.

Conclusion

This episode of the OurSQL podcast is a great introduction to replication and its quirks. I also believe that MySQL replication is one of the features that made the product so successful and wide spread. However, you need to understand its limitations if your business depends on it.

These are my $.02 on this topic, hoping to complement the podcast. I wanted to tweet my feedback to @oursqlcast, but it ended up being way more than 140 characters.

Wednesday, September 22, 2010

A Replication Surprise

While working on a deployment we came across a nasty surprise. In hindsight it was avoidable, but it never crossed our minds it could happen. I'll share the experience so when you face a similar situation, you'll know what to expect.

Scenario

To deploy the changes, we used a pair of servers configured to replicate with each other (master-master replication). There are many articles that describe how to perform an ALTER TABLE with minimum or no downtime using MySQL replication. The simple explanation is:
  1. Set up a passive master of the database you want to modify the schema. 
  2. Run the schema updates on the passive master.
  3. Let replication to catch up once the schema modifications are done.
  4. Promote the passive master as the new active master.
The details to make this work will depend on each individual situation and are too extensive for the purpose of this article. A simple Google search will point you in the right direction.

The Plan

The binlog_format variable was set to MIXED. While production was still running on the active master, we stopped replication from the passive to the active master so we would still get all the DML statements on the passive master while running the alter tables. Once the schema modifications were over, we could switch the active and passive masters in production and let the new passive catch up with the table modifications once the replication thread was running again.

The ALTER TABLE statement we applied was similar to this one:
ALTER TABLE tt ADD COLUMN cx AFTER c1;
There were more columns after cx and c1 was one of the first columns. Going through all the ALTER TABLE statements takes almost 2 hour, so it was important to get the sequence of event right.

Reality Kicks In

It turns out that using AFTER / BEFORE or changing column types broke replication when it was writing to the binlog files in row based format, which meant that we couldn't switch masters as planned until we had replication going again. As a result we had to re-issue an ALTER TABLE to revert the changes and then repeat them without the AFTER / BEFORE.

The column type change was trickier and could've been a disaster, fortunately this happened on a small table (~400 rows which meant the ALTER TABLE took less than 0.3sec). In this case we reverted the modification on the passive master and run the proper ALTER TABLE on the active master. Should this have happened with a bigger table, there was no other alternative than either rollback the deployment or deal with the locked table while the modification happened.

Once this was done we were able to restart the slave threads, let it catch up and and everything was running as planned ... but with a 2hr delay.

Unfortunately, using STATEMENT replication wouldn't work in this case for reasons that would need another blog article to explain.

Happy Ending

After the fact, I went back to the manual and I found this article: Replication with Differing Table Definitions on Master and Slave. I guess we should review the documentation more often, the changes happened after 5.1.22. I shared this article with the development team, so next time we won't have surprises.