Tuesday, February 1, 2011

A Small Fix For mysql-agent

If you're already using an SNMP monitoring tool like OpenNMS, mysql-agent is a great way to add a number of graphics using Net-SNMP. However mysql-agent has a small bug that drove me crazy. I will try to highlight the process on how I discovered it (and hence fix it) since it involved learning about SNMP, how to diagnose it and eventually, once all the pieces came together, how simple it is to write your own agents.

Although versions are not that important, just for the sake of completeness we were using CentOS 5.5, MySQL 5.5.8 Community RPMs, Net SNMP version 5.3.22 and OpenNMS Web Console 1.8.7.

The Problem

I followed the directions on the mysql-agent blog only to find that I was facing the only open issue listed on mysql-agent's Github repository (spoiler alert, the solution is at the bottom). The set up has several components, which makes it difficult to diagnose:
  • mysql-agent
  • snmpd +  agentx
  • OpenNMS server
Running snmpwalk on the MySQL host, as suggested in the mysql-agent article, worked fine (as far as we could tell). However, OpenNMS wasn't getting the data and the graphs weren't showing up.

It turns out that, once you completed the OpenNMS configuration as described in the article, it's a good idea to run snmpwalk remotely, from the server running OpenNMS, as well. You need to specify your MySQL hostname instead of localhost:
snmpwalk -m MYSQL-SERVER-MIB -v 2c -c public mysql-host enterprises.20267

In our case, it failed. Unfortunately the logs didn't offer much information and whatever was failing, it was inside agentx.

The Alternative

Since the NetSNMP Perl class hides a lot of the details of the Net SNMP API, we decided to use an alternative method to write the agent using pass_persist. The beauty of this method is that you only need to write a filter script: SNMP requests come through standard input (stdin) and the output needs to be printed to standard output (stdout). In consequence, the agent can be tested straight from the command line before implementing it. A nice article about pass_persist can be found here. The pass_persist protocol is fully documented in the snmpd.conf man page.

To follow this route we had to tweak the script a little. The tweaks included:
  • No daemonize: Since the script used stdin/stdout, it needs to run interactively.
  • All values need to be returned as strings. It was the only work around we found to deal with 64bits values that otherwise weren't interpreted correctly.
  • stderr needed to be redirected to a file to avoid breaking the script's returned values ( add 2>/tmp/agent.log to the end of the command line) while you run it interactively.
  • Use SNMP::Persist Perl module to handle the SNMP protocol.
Once the changes were implemented (I promise to publish the alternative mysql-agent script after some clean up) these are the steps I followed to test it (for now I'll leave the -v option out, along with the stderr redirection).
  1. Invoke the agent as you would've done originally, keeping in mind that now it'll run interactively. On your MySQL server:
    mysql-agent-pp -c /path/to/.my.cnf -h localhost -i -r 30
  2. Test if the agent is working properly (blue -> you type, red -> script output):
    PING
    PONG
  3. Does it actually provide the proper values?
    get
    .1.3.6.1.4.1.20267.200.1.1.0
    .1.3.6.1.4.1.20267.200.1.1.0
    Counter32
    21
    getnext
    .1.3.6.1.4.1.20267.200.1.1.0
    .1.3.6.1.4.1.20267.200.1.2.0
    Counter32
    16
Note that case is important PING needs to be capitalized, get and getnext need to be in small caps. Once you know it works you'll need to add the pass_persist line to the snmpd.conf file and restart snmpd:
# Line to use the pass_persist method
pass_persist .1.3.6.1.4.1.20267.200.1 /usr/bin/perl /path/to/mysql-agent -c /path/to/.my.cnf -h localhost -i -r 30
Now execute snmpwalk remotely and if everything looks OK, you're good to go.

On our first runs, snmpwalk failed after the 31st value. Re-tried the specific values and a few other ones after those with get and getnext and it became obvious that for some, the responses weren't the expected ones.

The Bug and The Fix

So now, having identified the failing values, it was time to dig into the source code.

First the data gathering portion, which fortunately is well documented inside the source code. I found ibuf_inserts and ibuf_merged as the 31st and 32nd values (note that with get you can check other values further down the list, which I did to confirm that the issue was specific to some variables and not a generic problem). A little grepping revealed that these values were populated from the SHOW INNODB STATUS output, which in 5.5 didn't include the the line expected in the program logic, hence, the corresponding values stayed undefined. A patch to line 794 on the original script fixed this particular issue by setting the value to 0 for undefined values.

794c794
<             $global_status{$key}{'value'} = $status->{$key};
---
>             $global_status{$key}{'value'} = (defined($status->{$key}) and $status->{$key} ne '' ? $status->{$key} : 0);
This fix can be used for the original script and the new pass_persist one. I already reported it upstread in GitHub.

The original script still failed. OpenNMS still requires getbulk requests (explained in the Net-SNMP documentation) that agentx fails to convert into getnext. This can be reproduced using snmpbulkwalk instead of snmpwalk (Note: It took some tcpdump + wireshark tricks to catch the getbulk requests). The current beta of the pass_persist version of mysql-agent has been in place for a while without issues.

Conclusion

I'm not highlighting all the process since it was long and complicated, but I learned a few concepts in during this time the I'd like to point out

Look Around Before Looking for New Toys

If you're using OSS, you may already have in house most of what you need. This project started when we decided to use OpenNMS (already in place to monitor our infrastructure) and wanted to add to it the MySQL data we wanted to monitor closely. A simple Google search pointed us to mysql-agent right away.

Embrace OSS

All the tools that we used in this case are Open Source, which made it extremely easy to diagnose the source code when pertinent, try alternatives, benefit from the collective knowledge, make corrections and contribute them back to the community. A full evaluation of commercial software, plus the interaction with tech support to get to the point where we needed a patch would've been as involved as this one and the outcome wouldn't have been guaranteed either. I'm not against commercial software, but you need evaluate if it will add any real value as opposed to the open source alternatives.

SNMP is Your Friend

Learning about the SNMP protocol, in particular the pass_persist method was very useful. It removed the mystery out of it and writing agents in any language (even bash) is far from difficult. I'm looking forward to go deeper into MySQL monitoring using this technology.

I'm hoping this long post encourages you to explore the use of SNMP monitoring for MySQL on your own.

Credit: I need to give credit to Marc Martinez who did most of the thinking and kept pointing me in the right direction every time I got lost.

NOTE: I'm not entirely satisfied with the current pass_persist version of mysql-agent I have in place, although it gets the job done. Once I have the reviewed version, I plan ... actually promise to publish it either as a branch of the existing one or separately.

5 comments:

  1. Thanks for your blog post and fix.

    Mysql-snmp supports bulk get without issue, but unfortunately you are running a really old Net Snmp version which contains a bug in the agentx protocol that chokes on Counter64.
    That's the reason the README says this:
    "This package requires Net-SNMP version 5.4.2 or better."

    Failure to use this version of Net SNMP (or a better one) prevents mysql-snmp to run fine. (Or maybe one of my patch to fix those issues can be applied to 5.3).

    But I'm glad you found another way :)

    masterzen

    ReplyDelete
  2. Pouet, I'm glad you liked it :)

    Get the new Net-SNMP running on our production servers opens a whole new can of worms, so if we wanted to move forward with the original goal, we needed an alternative solution.

    I also found out that there is a fix for non-initialized variables in a newer branch in the source tree.

    ReplyDelete
  3. hello, I would very much appreciate if you can post your code some where. I could definitely reuse your solution as I cannot get snmpwalk to work either.

    ReplyDelete
  4. @Hatim, thank you for your feedback. Our current modifications outgrew a simple patch so I'm looking at the code to get it published. It'll probably will be part of a new article during the next few weeks. Stay tuned

    ReplyDelete
  5. Hello,

    Any idea on availability of your changes. Would really help me out here as it is not working out of the box for me on my Mac server.

    I would like to get rid of MySQL agents as soon as possible.

    ReplyDelete