In the last post, we discussed how to detect when PATROL agent is unreachable including network down, server down, and PATROL agent down. One of my readers asked me if the same method can be used to detect if PATROL agent hangs.
When PATROL agent hangs, PATROL agent process is still up with a heartbeat. Therefore the method we discussed in the last post won't be able to tell if PATROL agent hangs. When PATROL agent hangs, it just stops collecting data.
The best way to detect when PATROL agent hangs is to write your own script to check when the last data collection happened. Pick up a common parameter with frequent data collection, e.g., /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent on Windows. By default, CPUprcrProcessorTimePercent is collected every minute. When you query /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time, you will receive the last data collection time stamp as Epoch time. Compare this time with the current time stamp, you will know how long it has been since CPU data was last collected. If it has been a while since CPU data was last collected (e.g. longer than 10 minutes), you can reasonably conclude that PATROL agent hangs.
You can use PatrolCli to query /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time. PatrolCli is an executable program installed automatically with PATROL agent under $PATROL_HOME/bin directory. It allows you to connect to any PATROL agent in your network with the right password and permission. And it allows you to run a subset of PSL commands from command line. Type 'PatrolCli' from the command line and use 'help' command to see all the commands it supports.
Because you are trying to detect if PATROL agent hangs, it would be the best to use a mechanism outside of PATROL framework to schedule your script and send out the alerts. For example, you can use cron on UNIX or scheduled task on Windows to run your script periodically. When you detect PATORL agent hangs, you can send alert emails or use msend command to directly report your alerts to BPPM/BEM cell.
I will be out of country and won't be able to post any blogs for the next two weeks. This blog will resume on July 21, 2014. Have a great summer!
BPPM (BMC ProactiveNet Performance Management) or TrueSight Operations Management (the rebranded name) suite is the latest solution from BMC Software for enterprise system management. It combines the data analytic engine from ProactiveNet, the event processing engine from BMC Event Manager (BEM), and the server/application monitor from PATROL into one product. This blog is intended to share information and experience on TrueSight/BPPM implementation, customization, and integration.
No comments:
Post a Comment