Wednesday, August 17, 2011

MySQL HA Agent Mini HowTo

Why This Post


While testing Yoshinori Matsunobo's MHA agent I found that although the wiki has a very complete documentation, it was missing a some details. This article intends to close that gap and bring up some issues to keep in mind when you do your own installation. At the end of the article I added a Conclusions section, if you're not interested in the implementation details, but to read my take on the project, feel free to jump straight to the end from here.

My Test Case


Most of our production environments can be simplified to match the MHA's agent most simple use case: 1 master w/ 2 or more slaves and at least one more slave in an additional tier:

Master A --> Slave B
         +-> Slave C --> Slave D

As noted in the documentation, in this case the MHA agent will be monitoring A, B & C only. I found that unless you have a dedicated manager node, a slave on the 3rd tier (Slave D above) is suitable for this role. All 4 servers were setup as VMs for my evaluation / tests. It makes it easier to simulate hard failure scenarios in a controlled environment. Once this is in place the fun begins.

1st Step: User Accounts


In all the examples in the documentation it uses root to login into MySQL and the OS. I prefer to create specific users for each application, so I created a specific MySQL user for the MHA agent and used the linux' mysql user (UID/GID = 27/27 in RedHat / CentOS).

MySQL Credentials

Reviewing the code, I was able to determine that the agent requires to run some privileged commands like: SET GLOBAL variable, CHANGE MASTER TO ..., FLUSH LOGS ..., SHOW SLAVE STATUS, etc. and creates internal working tables to be used during the master fail over. The easiest way to set it up was using:
GRANT ALL PRIVILEGES ON *.* TO mha_user@'ip address'  
IDENTIFIED BY password;
This should be repeated on all 4 servers using the IP addresses for all the potential manager nodes. Yes, it would be possible to use wildcards, but I consider restricting access from specific nodes a safer practice.

The MySQL replication user needs to be set up to connect from any other server in the cluster, since any of the slaves in the group could be promoted to be master, and have the rest of them connecting to it.

Linux User

As I mentioned before I use the default RedHat / CentOS definition for the mysql user. Keep in mind that if you installed from the official Oracle packages (ie: RPMs), they may not follow this criteria and could result in mismatching UID/GIDs between servers. The UIDs/GIDs for the mysql user and group have to be identical on all 4 servers. If this is not the case, you may use the following bash sequence/script as root to correct the situation:

#!/bin/bash 
# stop mysql
/etc/init.d/mysql stop
 
# Change ownership for all files / directories
find / -user mysql -exec chown -v 27 {} \;
find / -group mysql -exec chgrp -v 27 {} \;
 
# remove old user / group and rename the new ones
# might complain about not being able to delete group.
groupdel mysql
userdel mysql 

# Add the new user / group
groupadd -g 27 mysql
useradd -c "MySQL User" -g 27 -u 27 -r -d /var/lib/mysql mysql
 
# restart MySQL
/etc/init.d/mysql start

Once the mysql user is properly setup, you'll have to create password-less shared keys and authorize them on all the servers. The easiest way to do it is to create it in one of them, copy the public key to the authorized_keys file under the /var/lib/mysql/.ssh directory and then copy the whole directory to the other servers.

I use the mysql user to run the scripts since for most distributions it can't be used to login directly and there is no need to worry about file permissions, which makes it a safe and convenient user.

2nd Step: Follow The Documentation to Install and Configure


Once all the users have been properly setup, this step is straight forward. Check the Installation and Configuration sections of the wiki for more details.

For the placement of the configuration files I deviated a little bit from documentation, but not much:

  1. Used a defaults file: /etc/masterha_default with access only for user mysql since it includes the MHA agent password:
    -rw------- 1 mysql mysql 145 Aug 11 16:36 masterha_default.cnf
  2. The application settings were placed under /etc/masterha.d/ this way they're easy to locate and won't clutter the /etc directory.
For simplicity, I didn't include any of the optional scripts and checks (ie: secondary check) in the configurate. You may want to check the documentation and source code of these scripts. Some of them are not even code complete (ie: master_ip_failover). Unless you are implementing some of the more complicated use cases, you won't even need them. If you do, you'll need to write your own following the examples provided with the source code.

Once you have everything in place, run the following checks as the mysql user (ie: sudo su - mysql):
  1. masterha_check_ssh: Using my configuration files the command line looks like:
    masterha_check_ssh --conf=/etc/masterha_default.cnf --conf=/etc/masterha.d/test.cnf
  2. masterha_check_repl: This test will determine whether the agent can identify all the servers in the group and the replication topology. The command line parameters are identical to the previous step.

Both should show and OK status at the end. All utilities have verbose output, so if something goes wrong it's easy to identify the issue and correct it.

3rd Step: Run the Manager Script


If everything is OK, on the MHA node (Server D in my tests) run the following command as user mysql (ie: sudo su - mysql):

masterha_manager --conf=/etc/masterha_default.cnf --conf=/etc/masterha.d/test.cnf

You have to keep in mind that should the master fail, the agent will fail over to one of the slaves and stop running. This way it'll avoid split brain situations. You will either have to build the intelligence in the application to connect to the right master when failing or use a virtual IP. In both cases you'll might need to use customized IP failover scripts. The documentation provides more details.

Read the section about running the script in the background to choose the method that best fits your practice.

You will have to configure the notification script to get notified of the master failure. The failed server will have to be removed from the configuration file before re-launching the manager script, otherwise it will fail to start.

You can restart the failed server and set it up as a slave connected to the new master and reincorporate it to the replication group using masterha_conf_host.

Conclusion


This tool solves a very specific (and painful) problem which is: make sure all the slaves are in sync, promote one of them and change the configuration of all remaining slaves to replicate off the new master and it does it fairly quickly. The tool is simple and reliable and requires very little overhead. It's easy to see it is production ready.

The log files are pretty verbose, which makes it really easy to follow in great detail all the actions the agent took when failing over to a slave.

I recommend to any potential users to start with a simple configuration and add the additional elements gradually until it fits your infrastructure needs.

Although the documentation is complete and detailed, it takes some time to navigate and to put all the pieces of the puzzle together.

I would like the agent to support master-master configurations. This way it would minimize the work to re-incorporate the failed server into the pool. Yoshinori, if you're reading this, know that I'll volunteer to test master-master if you decide to implement it.

2 comments:

  1. Hi Gerry,

    Thanks for trying MHA and writing reports!
    Not only you, but also some other users have requested multi-master support in MHA so I've decided to implement it.
    If you have any specific request for multi-master, would you please comment at the below issue ticket?
    https://github.com/yoshinorim/MySQL-MasterHA-Manager/issues/5

    ReplyDelete