Monday, September 23, 2013

Monitoring PATROL Agent 9.x Status: Do I need AS_AVAILABILITY KM?

If you use PATROL to monitor your IT infrastructure, you would want to monitor the status of all your PATROL agents to make sure they are up and running. Although BMC recommends using PATROL AS_AVAILABILITY KM to monitor PATROL agent status, there is actually a much simpler way by using BPPM cell heartbeat events.

So the answer here is no. You don't need to use AS_AVAILABILITY KM to monitor PATROL agent 9.x status. As you are about to see, BPPM cell heartbeat events are totally automatic with built-in high availability and no PATROL agent configuration. AS_AVAILABILITY KM was developed before BMC acquired BPPM cell, it is still a great option if you have a 'PATROL only' environment without BPPM/BEM.

To use AS_AVAILABILITY KM, you need to configure the KM by selecting one PATROL agent as the 'pinger' and adding other PATROL agents as 'pingees'. Every time you deploy a new PATROL agent or decommission an existing PATROL agent, you would need to change AS_AVAILABILITY KM configuration. On the other hand, to use BPPM cell heartbeat events, you don't need to go through extra steps to register each PATROL agent with BPPM cell. As long as you set pconfig variable "/EventSetup/Configuration/EventCells" in your PATROL agent 9.x to send PATROL events to a BPPM cell, that BPPM cell will automatically monitor the status of the PATROL agent.

If the 'pinger' in your AS_AVAILABILITY KM goes down, you won't be able to monitor the status of other PATROL agents. To make AS_AVAILABILITY KM more robust, you would have to set up a second 'pinger' and complicated logic to coordinate between these two 'pingers' to avoid duplicated alerts. On the other hand, as long as your BPPM cell is set up as high availability, you don't need to go through extra steps to make BPPM cell heartbeat events as high availability.  Your PATROL agent status will always be monitored by the active H/A BPPM cell.

To best use BPPM cell heartbeat events, I recommend to reword the event message because the out-of-box message doesn't contain enough information. When a PATROL agent goes down, you would receive an event with out-of-box slots like this:
MC_CELL_HEARTBEAT_FAILURE;
  cell='PatrolAgent@server1@172.118.2.12:3181';
  msg='Monitored Cell is no longer responding';
  ...
END
You may want to reword the msg to 'PatrolAgent@server1@172.118.2.12:3181 is no longer responding'.  For its reciprocal MC_CELL_HEARTBEAT_ON event, you may want to reword its message in a similar way.


No comments:

Post a Comment