Monday, June 30, 2014

PATROL Agent 9.5 - Part 2: When PATROL agent hangs

In the last post, we discussed how to detect when PATROL agent is unreachable including network down, server down, and PATROL agent down.  One of my readers asked me if the same method can be used to detect if PATROL agent hangs.

When PATROL agent hangs, PATROL agent process is still up with a heartbeat.  Therefore the method we discussed in the last post won't be able to tell if PATROL agent hangs.  When PATROL agent hangs, it just stops collecting data.

The best way to detect when PATROL agent hangs is to write your own script to check when the last data collection happened.  Pick up a common parameter with frequent data collection, e.g., /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent on Windows.  By default, CPUprcrProcessorTimePercent is collected every minute.  When you query /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time, you will receive the last data collection time stamp as Epoch time.  Compare this time with the current time stamp, you will know how long it has been since CPU data was last collected.  If it has been a while since CPU data was last collected (e.g. longer than 10 minutes), you can reasonably conclude that PATROL agent hangs.

You can use PatrolCli to query /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time.  PatrolCli is an executable program installed automatically with PATROL agent under $PATROL_HOME/bin directory.  It allows you to connect to any PATROL agent in your network with the right password and permission.  And it allows you to run a subset of PSL commands from command line.  Type 'PatrolCli' from the command line and use 'help' command to see all the commands it supports.

Because you are trying to detect if PATROL agent hangs, it would be the best to use a mechanism outside of PATROL framework to schedule your script and send out the alerts.  For example, you can use cron on UNIX or scheduled task on Windows to run your script periodically.  When you detect PATORL agent hangs, you can send alert emails or use msend command to directly report your alerts to BPPM/BEM cell.

I will be out of country and won't be able to post any blogs for the next two weeks.  This blog will resume on July 21, 2014.  Have a great summer!

Monday, June 23, 2014

PATROL Agent 9.5 - Part 1: When PATROL agent is unreachable

I got a question recently regarding to the best way to monitor PATROL 9.5 agent health.  In one of my older post "Monitoring PATROL Agent 9.x Status: Do I need AS_AVAILABILITY KM?", I discussed using MC_CELL_HEARTBEAT_FAILURE event instead of AS_AVAILABILITY KM to detect a condition when a PATROL agent is not responding. This still applies to PATROL agent version 9.5.

Similar to AS_AVAILABILITY KM, there are 3 possible causes when a MC_CELL_HEARTBEAT_FAILURE event is received: PATROL server down, PATROL agent down, or network down. 

MC_CELL_HEARTBEAT_FAILURE events are triggered and reported automatically.  You don't need to configure anything as long as you have the following pconfig variables set in each PATROL agent:

"/EventSetup/Configuration/Format" = { REPLACE = "BiiP3" },
"/EventSetup/Configuration/Key" = { REPLACE = "mc" }
"/AgentSetup/integration/integrationServices"={REPLACE="tcp:ISN1:3183,tcp:ISN2:3183"}

and either

"/EventSetup/Configuration/ForwardEventsToIS " = { REPLACE = " Yes " }
or

"/EventSetup/Configuration/EventCells" = { REPLACE = "ISN1/1828,ISN2/1828" }

When PATROL agent is unreachable,  you will receive a MC_CELL_HEARTBEAT_FAILURE event.  When PATROL agent is reachable again, you will receive a MC_CELL_HEARTBEAT_ON event.  You do need to reword the msg event slot to make the message more meaningful.

We will discuss how to detect when PATROL agent hangs in the next post.


Monday, June 16, 2014

BPPM 9.5 Quick Start - Part 11: High availability consideration

As the final post in 'BPPM 9.5 Quick Start' series, let's look at high availability.  Without a doubt, BPPM 9.5 has made some great progress in high availability by introducing integration service clusters. A BPPM 9.5 integration service node (ISN) can fail over to the other ISN in the cluster seamlessly with no data loss. PATROL agents connected to the ISN cluster can buffer and resend data for up to 30 minutes.

However no progress has been made in the high availability of BPPM server.  Your only option is still a disk-based cluster provided by the operating system such as Microsoft Windows Cluster.  There are a couple of drawbacks in a disk-based BPPM cluster: 1) Up to 10 minutes downtown after the primary server goes down and before the secondary server comes up; 2) High cost - the cost for two servers in the cluster plus the software is usually about the cost of three servers.

If your business cannot justify to implement a disk-based BPPM server cluster especially if you are also required to implement a DR BPPM server in another data center, you would wonder if there are something else you can do to improve the high availability for your overall solution without incurring the cost and complexity of a disk-based server cluster.

In a business, the most critical incidents that violate service level agreement are availability alerts.  Without high availability of BPPM server, you will need to use a pair of high-availability remote BPPM cells instead of the embedded cells on BPPM server to send notifications and initiate incident ticket creation.  Actually as long as those availability alerts do not come out of service models, a pair of H/A BPPM remote cells usually works better than the embedded cells in a disk-based BPPM server cluster because it is a native application cluster.

In order to initiate incident ticket creation on BPPM remote cells, you will need to install IBRSD on these cells.  This step is in addition to installing CMDB integration module on BPPM server.  CMDB integration module automatically includes IBRSD.  Now you can let BPPM server initiate service model related incident ticket creation and let BPPM remote cells initiate incident ticket creation not related to service models.

To address lack of GUI access when BPPM server goes down, you can install BEM version 7.x login server (also called admin server) and BMC Impact Explorer (IX) as an emergency GUI.  Register all your BPPM cells with the BEM login server.  When BPPM server goes down, you can still see all of your BPPM remote cells from IX.  All BPPM 9.x cells display well in BEM 7.x IX.

In BPPM 9.5, the integration service has been made totally stateless so that PATROL data travel through it to BPPM server without stopping.  If BPPM server goes down, PATROL data will be buffered at PATROL agent for up to 30 minutes.  This means no data loss as long as BPPM server is up again within 30 minutes.  Without data, you will lose intelligent events such as anomaly and trend prediction.  If they are not critical to your business for a short period of time, you can save some cost right now by postponing BPPM server H/A implementation until BMC comes up with an application level BPPM server H/A solution.

Monday, June 9, 2014

BPPM 9.5 Quick Start - Part 10: PATROL agent installation and configuration

There are several ways to deploy large number of PATROL agents including creating silent installation package using CMA, using RSCD agents in BMC Server Automation (BSA), and using old PATROL Distribution Server.  But if you just need to install a few PATROL agents in a development environment to understand how BPPM 9.5 works, you can just use PATROL installer to install each PATROL agent interactively.

To get PATROL data and events into BPPM server, the first step in PATROL configuration is to specify its data and event destinations.  Since BPPM 9.5 allows you to send both PATROL data and events to the same port on integration service node (ISN), your typical pconfig setting would be:

"/EventSetup/Configuration/Format" = { REPLACE = "BiiP3" },
"/EventSetup/Configuration/Key" = { REPLACE = "mc" }
"/EventSetup/Configuration/ForwardEventsToIS " = { REPLACE = " Yes " }
"/AgentSetup/integration/integrationServices"={REPLACE="tcp:ISN1:3183,tcp:ISN2:3183"} 

where ISN1 and ISN2 are host names for the first ISN and second ISN in high-availability cluster.

If you don't want to send PATROL events to the same port as PATROL data, you can still use what you had in BPPM 9.0 by replacing
"/EventSetup/Configuration/ForwardEventsToIS " = { REPLACE = " Yes " }
with
"/EventSetup/Configuration/EventCells" = { REPLACE = "ISN1/1828,ISN2/1828" }

Once the above configuration is in PATROL agent, you will see  all PATROL agents linked with their ISNs in CMA GUI display.

Next, you will need to add tags to PATROL agent configuration. Here is the format for a tag in pconfig

"/AgentSetup/Identification/Tags/TAG/<tag_name>" = {REPLACE = "Tag description"}

For example, you may want to define a tag for Windows OS base monitoring.  You will have the following variable in pconfig:

"/AgentSetup/Identification/Tags/TAG/WinBase" = {REPLACE = "Windows OS Base Monitoring"}

where 'WinBase' is the tag name you created, and it has to match a tag specified in a CMA policy in order to receive the configurations contained in that CMA policy.  "Windows OS Base Monitoring" is a description you give to this tag, and it does not have to match anything.

When the tag in PATROL agent pconfig matches the tag in a CMA policy, CMA will push all its configurations contained in that policy to PATROL agent immediately.  Whenever any configuration is updated in a policy, all PATROL agents with matching tags will receive the same update immediately.  No PATROL agent restart is needed.








Monday, June 2, 2014

BPPM 9.5 Quick Start - Part 9: CMA monitoring policy creation

Central Monitoring Administration (CMA) has been redesigned for BPPM 9.5.  You will need to create new monitoring policies in BPPM 9.5.  CMA sends configuration information to PATROL agents 9.5 through BPPM child server and BPPM integration service.

Each monitoring policy in CMA contains a name and a set of tags to match PATROL agents.  A PATROL agent can contain one or more tags in its pconfig variables. When a tag contained in a PATROL agent matches a tag specified in a monitoring policy criteria, all configuration contained in the policy is pushed to the PATROL agent as pconfig variable values whenever PATROL agent is connected to CMA (such as when PATROL agent restarts) and whenever the monitoring policy changes in CMA. 

To start creating CMA policies, login into CMA and open 'Policies' drawer under 'Navigation' bar at left.  Open 'Monitoring Policy Configuration' window.  Follow the order in the menu to input information to two mandatory configuration selections first: General and Agent Selection.  You will be asked to input policy name, precedence number, and a set of tags to match PATROL agents.

Then you can input information to seven optional configuration selections: Monitoring Configuration, Filtering Configuration, Agent Threshold Configuration, Server Threshold Configuration, Agent Configuration, Server Configuration, and Configuration Variables.  Monitoring configuration is categorized by monitoring solution (package name), version number, monitoring profile (KML name), and monitor type (KM name).  If you have created your own custom KMs, you will need to import them to Central Monitoring Repository first before you can create monitoring configuration for your custom KMs.

Most of the policy creation process is self explanatory if you follow the GUI, but it is a fairly time consuming process.  Because one policy can match multiple tags and one PATROL agent can contain multiple tags, it is recommended that you carefully plan a systematic way to name all your tags based on operating systems, applications, versions, environments, and locations before starting massive policy creation process.

Similar to the process of creating rule sets in PATROL Configuration Manager (PCM), decide on a few common denominators first such as base UNIX OS policy and base Oracle policy.  Then build your policies in hierarchy such as base Windows OS policy, followed by production Windows OS policy, followed by base SQL Server policy, and followed by production SQL Server policy.  A lower precedence number indicates a higher precedence.