Thursday, May 5, 2011

Some More Replication Stuff

Listening to the OurSQL podcast: Repli-cans and Repli-can’ts got me thinking, what are the issues with MySQL replication that Sarah and Sheeri didn’t have the time to include in their episode. Here’s my list:

Replication Capacity Index

This is a concept introduced by Percona in last year’s post: Estimating Replication Capacity which I revisited briefly during my presentation at this year’s MySQL Users Conference. Why is this important? Very simple: If you use your slaves to take backups, they might be outdated and will fall further behind during the backups. If you use them for reporting, your reports may not show the latest data. If you use it for HA, you may not start writing to it until the slave caught up.
Having said that, measuring replication capacity as you set up slaves is a good way to make sure that the slave servers will be able to catch up with the traffic in the master.

More On Mixed Replication

The podcast also discussed how mixed replication works and pointed to the general criteria that the server applies to switch to STATEMENT or ROW based. However there is one parameter that wasn’t mentioned and it might come back and haunt you: Transaction Isolation Level. You can read all about it in the MySQL Documentation: 12.3.6. SET TRANSACTION Syntax and in particular the InnoDB setting innodb_locks_unsafe for binlog.

Keep Binary Logs Handy

Today I found this article from SkySQL on Planet MySQL about Replication Binlog Backup, which is a really clever idea to keep your binary logs safe with the latest information coming out of the master. It offers a method of copying them without the MySQL server overhead. If you purge binary logs automatically to free space using the variable expire_logs_days, you will still have the logs when you need them for a longer time than your disk capacity on the master might allow.

Seconds Behind Master (SBM)

Again, another topic very well explained in the podcast, but here’s another case where this number will have goofy values. Lets say you have a master A that replicates master-master with server B and server C is a regular slave replicating off A. The application writes to A and B serves as a hot stand-by master.
When we have a deployment that requires DDL and/or DML statements, we break replication going from B to A (A to B keeps running to catch any live transactions) and apply the modifications to B. Once we verify that everything is working OK on B, we switch the application to write to B and restore replication going back to A. This offers a good avenue for rolling back in case the deployment breaks the database in any way (ie: rebuild B using the data in A). What we frequently see is, if the DDL/DML statement takes about 30min (1800 sec) on B, once we restore replication as explained, the slave C will show outrageous numbers for SBM (ie: >12hs behind, I really don’t know how does the SBM arithmetic works to explain this). So it’s a good idea to complement slave drifts monitoring with mk-heartbeat, which uses a timestamp to measure replication drifts.

Conclusion

This episode of the OurSQL podcast is a great introduction to replication and its quirks. I also believe that MySQL replication is one of the features that made the product so successful and wide spread. However, you need to understand its limitations if your business depends on it.

These are my $.02 on this topic, hoping to complement the podcast. I wanted to tweet my feedback to @oursqlcast, but it ended up being way more than 140 characters.