Monday, September 29, 2014

PatrolCli - Part 6: Re-sync parameter status with PATROL agents upon exiting blackout in BPPM cell

Event blackout rule or event blackout policy in BPPM cell is something we all rely on to suppress alerts during regularly scheduled maintenance window.  Upon exiting blackout period, if a PATROL parameter alert (e.g. process down alert) is still present, what should you do?

If you choose to ignore it and the process is still down, no one will be notified.  PATROL agent only generates an alarm event once when a process goes down.  If the process went down during the blackout period with no notification sent from BPPM cell, PATROL agent will never generate another alarm event again if the process remains down after blackout period ended.

If you choose to send a notification for every suppressed alert in BPPM cell upon exiting blackout, you may send out lots of false alarms.  During the maintenance window, many PATROL agents may be restarted as the result of server reboot or PATROL configuration change.  The process that was previously down may be brought up as the result of PATROL agent or server restart.  However a newly started PATROL agent will not generate an OK event since there is no state change on PATROL parameter.

Either way, we have a problem.  The best solution is for BPPM cell to re-check PATROL parameter status for each outstanding alert upon exiting blackout. From all the PATROL users I have talked to, this is one of the most-wanted features for event blackout.  Although this feature doesn't come out of box, you can write your own code using PatrolCli.

For example, you can use the following PatrolCli command to check 'mcell' process status:

PCli% execpsl get("/NT_PROCESS/mcell/PROCStatus/status");
OK

It does require some advanced MRL programming skill to tie everything together.  If you need more help, please feel free to contact us for consulting services.  We have developed a proprietary extension for BPPM cell that have addressed many out-of-box limitations including event blackout.

Monday, September 22, 2014

PatrolCli - Part 5: Remote control PATROL agent recovery actions from BPPM cell

PATROL agent recovery action is a script triggered by a parameter state change.  It is often used to restart a process when it is down or dump diagnostic messages when a performance indicator shows an abnormal value.

PATROL agent recovery action is a powerful feature.  But the drawback is that the triggering parameter value can only be set by the local PATROL agent.  In today's complex enterprise IT, sometimes it requires data/events from multiple servers or multiple monitoring software to determine if a recovery action is necessary.  Sometimes it requires BPPM Analytics to determine that a performance indicator is out of normal range.

When multiple servers, multiple monitoring software, or BPPM Analytics are involved, BPPM cell is the only component with the capability to know that a PATROL agent recovery action is needed.

So how can BPPM cell communicate back to PATROL agent and trigger PATROL agent recovery action?

Since PATROL agent is normally installed on BPPM cell server, you can use PatrolCli to change the state of PATROL parameter remotely from BPPM cell server.  The state change of PATROL parameter will trigger the attached recovery action immediately.

For example, if you have a recovery action attached to parameter /NT_OS/NT_OS/_CollectionStatus, you can change its state to ALARM by including the following PatrolCli command in an OS script called trigger_recovery.cmd located on BPPM cell server:

execpsl "set(\"/NT_OS/NT_OS/_CollectionStatus/status\", ALARM);"

Then simply invoke trigger_recovery.cmd in execute() function of a MRL rule from BPPM cell when BPPM cell determines that it is time to trigger a PATROL recovery action.  This determination can be made by correlating multiple events sent from multiple servers by multiple monitoring software.  This determination can also be made by receiving an intelligent event generated from BPPM Analytics.

Monday, September 15, 2014

PatrolCli - Part 4: Event reporting to help troubleshoot missing events

This happens to all of us: You are expecting a critical event sent from a PATROL agent, but for some reason you don't see it in BPPM/BEM GUI.  Now you need to troubleshoot. 

First, you need to determine if the PATROL agent never sent the event out or if a cell rule/policy dropped the event.  To determine if the PATROL agent has ever sent the event out or not, you can use PATROL console and look through the events using its event manager GUI.  However, this method is not very reliable especially when your PATROL agent generates a lot of events.  If you cannot find the event in PATROL console, it doesn't necessarily mean that PATROL agent didn't send it out.

A quicker and more reliable way to determine if the PATROL agent has ever sent the event out or not is to run an event report using PatrolCli.  Not only it saves you the effort to bring up and log into PATROL console, it will save all events meeting your criteria into a text file so that you can search through it over and over again for multiple events.

Here is an example to generate an event report.  Assuming that you are expecting an event from PATROL agent running on server1 to send an ALARM/Critical event from parameter /NT_LOGICAL_DISKS/C:/LDldDiskTimePercent.  But for some reason you don't see it in BPPM/BEM GUI.  You would like to find out if PATROL agent has ever sent that event to BPPM cell. 

To do so, you can simply start PatrolCli from the server you are currently working on as long as it has a PATROL agent running and has permission to connect to PATROL agent running on server1.  Then you just set up an event filter and run an event dump command to save your event report to a file on your current server.

Myserver> PatrolCli
PCli% open server1 3181
Username: patrol
Password:
PCli% event setfilter 091520002014 "" "" A "" "" "" "" ""
OK
PCli% event dump C:\tmp\events.txt W
OK
 
In this example, the setfilter command set the event filter as all ALARM events starting on Sep 15 20:00:00 2014 until current.  The event dump command specified the location of the event report output file.  The mode 'W' is for write vs mode 'A' for append.

Once the above command has finished, you can open the event report file with a wordpad and search for the events generated from parameter /NT_LOGICAL_DISKS/C:/LDldDiskTimePercent.  For example, you may see your event in your event report like this:

Id          : 2118531
Status      : OPEN
Type        : ALARM
Severity    : 4
Time        : Mon Sep 15 22:46:40 2014
Node        : server1
Origin      : NT_LOGICAL_DISKS.C:.LDldDiskTimePercent
Catalog     : 0
Class       : 11
Description : Alarm #2 of global parameter 'LDldDiskTimePercent' triggered on 'NT_LOGICAL_DISKS.C:'.  69 <= 100.00 <= 100

Monday, September 8, 2014

PatrolCli - Part 3: Run PatrolCli commands from an OS script

In order to execute PatrolCli commands automatically on a schedule, you will need to put them in an OS script - e.g., shell script on UNIX or batch script on Windows.  Once PatrolCli commands are in an OS script, you can schedule to run the script periodically by using any vendor's scheduling utility or native OS scheduling utility (such as cron on UNIX or admin tool on Windows).

PatrolCli has a '-f' option to execute commands in batch mode.  You can use the following format to supply all PatrolCli commands in an input text file instead of typing them interactively at 'PCli%' prompt:

PatrolCli -f input.txt

Often we want to capture the output of the above command and send it out as an email.  Or you can use it as the input to call 'msend' API and generate an event in BPPM cell.  In the following example running on Windows, we want to detect if VSM_VC KM has stopped collecting data for at least 15 minutes on host 'server1'.  If so, an email will be sent from PatrolCli@yourcompany.com to bppm_admin@yourcompany.com.

Step 1: Create a text file called input.txt as follows:

user patrol password
connect server1 3181
execpsl int(time()-get("/VSM_VC/crpvms1vcenter2.yourcompnay.com/Comm_Status/time"));
exit

Step 2: Create a batch script called VSM_VC_Check.cmd as follows:

@ECHO OFF
PatrolCli -f input.txt > output.txt
set /p result= <output.txt
del output.txt

if %result% GEQ 900 (
echo No VSM_VC data collected for %result% seconds > tmp.txt
smail.exe tmp.txt -t bppm_admin@yourcompany.com -s "VSM_VC data collection stopped" -server <SMTP_Server> -f PatrolCli@yourcompany.com
del tmp.txt
)

Step 3: Schedule script VSM_VC_Check.cmd with Windows admin tool to run it periodically (e.g. every 15 minutes).

Monday, September 1, 2014

PatrolCli - Part 2: Monitor PATROL agent health outside of PATROL/BPPM framework

Sometimes, you have a need to monitor PATROL agent health outside of PATROL/BPPM framework either manually or automatically using a different tool.  For example, during the upgrade of BPPM server or PATROL console, you may not be able to access PATROL agents using BPPM console or PATROL console.

Using PatrolCli, you can check PATROL agent health running on a remote server without BPPM console or PATROL console.  Based on the output you have received, you will be able to tell if the PATROL agent running on a remote server is experiencing a problem.

If PATROL agent is down, you will receive an error message similar to the following:

Myserver> PatrolCli
PCli% open RemoteServer 3181
Username: patrol
Password:
Can't connect to RemoteServer (TCP/3181) as patrol : connecting to agent RemoteServer ...

If PATROL agent is running but stopped collecting data, you can use PatrolCli to check the latest timestamp of a common parameter, e.g., CPUprcrProcessorTimePercent on Windows or CPUCpuUtil on UNIX.

Here is an example to retrieve the latest timestamp of CPUprcrProcessorTimePercent. The timestamp is displayed as epoch time.

PCli% execpsl get("/NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time");
1409639472

Comparing the latest timestamp for CPU data collection with current time:

PCli% execpsl time();
1409639562

If the difference between the current time and the latest data collection timestamp is too long (> 10 minutes), you can reasonably conclude that PATROL agent has stopped collecting data.  In our example, 1409639562 - 1409639472 = 90 seconds. The data collection looks good.

In the next post, we will discuss how to run PatrolCli from a script so you can use another scheduling tool such as UNIX cron or Windows admin program to periodically check PATROL agent health automatically.