Monday, December 29, 2014

Total Cost of Ownership of BPPM - Part 12: Best practice - Fail to plan means plan to fail

I recently took a client project of BMC TrueSight/BPPM 9.6 implementation.  The client insisted to do the installation themselves and had me involved after the installation was completed.  When I was finally involved with the project after they spent weeks to finish the installation, I found out that many things were installed incorrectly: The BPPM server was under capacity, BPPM and Entuity were sharing the same server, the 2 BPPM Integration Service servers and 2 BPPM Cell servers were each configured as application cluster in an Windows OS cluster.  On top of that, there was no written architecture plan.

Needless to say, the installation needed to be re-done.  Had they had an architecture drawing or written plan before starting installation, and let me (or someone with BPPM architecture experience) review and confirm it, they could have saved themselves a lot of time in installation.

So what need to be planned before you start installation?

1) Server capacity: BPPM server is a heavy duty server.  It includes web server, app server, analytic engine, BPPM cell, and optional Sybase database.  All these components are memory "hungry".  Unless you are installing for a proof-of-concept environment, go for 32GB RAM.  If you are implementing both BPPM and Entuity, install Entuity on a separate server.  With a MySQL database on the same server, Entuity needs to have 32GB RAM too.  Virtual machines work OK though physical machines would give you more power.  BPPM Integration Service server is much less demanding and would work OK with 8GB RAM.  Each Integration Service (pair) can connect to 900 PATROL agents.  BPPM cell server needs 4GB RAM for one cell or 8GB RAM for 2-4 cells.

2) High availability: There are three different options for high availability - application cluster, VMWare cluster, and OS cluster (such as Microsoft Windows Cluster).  Application cluster can fail over at individual process level when a process fails.  OS cluster can fail over at host level when a process fails.  VMWare cluster can fail over at host level when a host fails.  For BPPM Integration Service servers and BPPM cell servers, configure them as application clusters if you have enough dedicated hardware.  Do not mix an application cluster with OS cluster.  For BPPM servers, since they don't support application cluster, use OS cluster if you have enough dedicated hardware.  If you don't have enough dedicated hardware and run everything on virtual machines, you can configure VMWare cluster on all servers. 

3) Integration: BPPM server can be integrated with many other BMC products such as BMC Sever Automation, Atrium CMDB, and Single Sign On.  If you plan to configure any of these integrations during BPPM server installation, you need to have detailed integration information ready.  However you do have an option to configure your integrations later.

4) Name convention: In additional to installation directory path, you need to have a systematic way to determine the name for each BPPM cell, BPPM integration service, and BPPM integration service cluster.  More on name convention planning was discussed in this earlier post: BPPM 9.5 Quick Start - Part 3: Name convention

Monday, December 22, 2014

Total Cost of Ownership of BPPM - Part 11: Best practice - Document implementation with check points

You were excited to start implementing a new version of BPPM.  You spent days or weeks reading through BMC documents, communicating with your system administrators to get the servers ready, communicating with your DBAs to get your database ready, navigating through complicated installation processes with various options regarding to high availability and integration, and you may have had to open cases with BMC Support to finally finish installing different BPPM components on different systems.

Then when you started all BPPM components and log into BPPM GUI, you realized it didn't work - BPPM components didn't communicate with each other, data or events didn't show up in BPPM GUI, and several errors were seen in log files.  Which part went wrong?  You ended up opening another case with BMC Support.

Sounds familiar?  What is worse is that, after you spend another week or so working with BMC Support to finally figure out what went wrong and make the correction, the similar headache may happen again when you start BPPM implementation in another environment.

This situation can be improved or even avoided by adding check points in your implementation process.  Check points are verification steps to perform after completing installation or configuration of partial or one BPPM component.  Check points give you confirmation of "so far so good" before moving into the next step in implementation.

For example, before you install BPPM server, a check point should be running some SQL statements to verify correct Oracle user accounts and tablespaces are created. When finishing installing a H/A pair of BPPM Integration Services and one PATROL agent, another check point should be verifying that PATROL agent can fail over to the secondary BPPM Integration Service when the primary one goes down.

Check points help you notice the problems in your installation or configuration so you can correct them as they occur.  If a check point fails, it is a whole lot easier to find out what went wrong and correct it right there than realizing things don't work at the end of implementation.

In addition, it is highly recommended that you write your own version of installation and configuration documents with all check points.  The installation and configuration documents from BMC are generic version.  It will take you much longer to read them and apply them to your own environment.

When implementing a new version of BPPM for the first time, write down all the steps you performed and information you entered to create your own version of installation and configuration documents.  I personally like to save all screen shots in the documents because one picture is worth a thousand words.  And of course, always include detailed steps for check points.

This may seem to be time consuming first especially when you are under the pressure to have the installation "done" before the deadline.  But having your own version of installation and configuration documents will drastically speed up future deployment and trouble-shooting process.  My experience from working on BMC products for the last 20 years has proven to me that it is totally worth the effort.

Monday, December 15, 2014

Total Cost of Ownership of BPPM - Part 10: Best practice - GUI vs Batch

BPPM has provided many different graphical user interfaces to help you configure different BPPM components.  For example, you can use PATROL Central Console or PATROL Classical Console to configure a PATROL KM.  The GUI provides a way to validate many things you typed including directory names, Oracle instance names, etc.

GUI is great to configure your first server, your second server, perhaps not even bad on your fifth server.  How about if you have 200 Oracle servers to configure? Don't you wish you have a less tedious and faster way to accomplish your work?

Yes when it comes to mass deployment, batch would be a whole lot faster and easier to accomplish your work and keep you from going nuts with repetitive work. Here is how you can move from GUI to batch for your PATROL KM configuration.

For PATROL KM configuration, on the first server, you run 'pconfig +get > C:\temp\pconfig1.txt' command to retrieve all pconfig values into a text file. Then you simply configure the KM on your first server using GUI.  After that run 'pconfig +get > C:\temp\pconfig2.txt' command to retrieve all pconfig values into another text file.  Do a diff on these two text files, you will have all pconfig variables that belong to this KM.  Look at each pconfig variable to find out what needs to be changed for a different server.  If you are not sure, repeat the same process on your second server to confirm what values need to be changed.  Simply make the changes either manually or using a script.  You will end up having different pconfig sets to be deployed to different servers using either 'pconfig' command or PATROL Configuration Manager (PCM).

Where else can you use batch deployment?  You may need to use batch deployment with BPPM cell since you may have total 10-100 cells in your dev, QA, and production environment combined.  Many of your BPPM cells have the same policies and data tables that are configured through GUI.  In this case, you can simply export the policies and data tables from an existing cell, make any necessary modifications, and import them into the new cell.  Using batch deployment on BPPM cell configuration will not only cut down your deployment time, it will also eliminate discrepancies due to human errors.

In general, any GUI configuration in your environment is worth finding a way for batch deployment if the work needs to be repeated more than 3 times.




Monday, December 8, 2014

Total Cost of Ownership of BPPM - Part 9: Best practice - Store only one copy of the same file

In the previous post, we discussed file repository where you store all your customization files. It simplifies deployment because you always know where the most up-to-date copy of each file is located.  However if you keep more than one copy of the same file in your file repository, these multiple copies can become out of sync and then you have to figure out which one is the most up-to-date copy when you deploy.

How can you end up having multiple copies of the same file in your file repository? Different server roles and different environments can require different file content with the same file name.  For example, .load file would be different for your PATROL cell and for your correlation cell.  IBRSD.conf file would be different for your development environment and for your production environment. 

Does this mean that you have to have separate file repositories for different cells and different environments?  No.  Since most of the files are still the same in both your development environment and your production environment, having separate file repositories can make them out of sync quickly.

Because only small percentage of customization files are different in different environments, I simply use a renaming method to store and deploy them.  For example, since IBRSD.conf is different in development environment and in production environment while most of other IBRSD files are the same for both environments, I store both copies of IBRSD.conf in my file repository as IBRSD.conf.dev and IBRSD.conf.prd in the same directory.  When I deploy IBRSD to a production server, both IBRSD.conf.dev and IBRSD.conf.prd will be copied to the destination server.  Then I will run a post deployment script to rename any file with a .prd extension to its original file name without .prd extension.

By using this renaming method, I can handle a small number of file variations while avoiding to store multiple copies of the same file in my file repository.  This method also keeps the size of file repository small.

Having a small file repository means that you can back it up on a thumb drive as the files are mostly ASCII.  In a catastrophic situation, you can rebuild your entire system quickly at another location with exactly the same customization you had at your original location.



Monday, December 1, 2014

Total Cost of Ownership of BPPM - Part 8: Best practice - File repository

During BPPM implementation, you will need to create some of your own files or modify out-of-box BPPM files.  These customization files include configuration files, cmd files, shell scripts, Perl scripts, KM files, PSL files, BAROC files, MRL files, etc.

One of the common practices I have seen at many organizations is to simply create new files and leave them on the server.  When the same files are needed for the second server, they are copied from the first server.  When the number of files becomes large, individual file copy is replaced by directory copy.

This practice certainly works in some extent, but there are several issues. For example,

1) If you make modification to some of the files while working on the third server, you will need to remember to repeat the same modification on the first and second servers.  Otherwise, your files on those servers will be out of sync.  A few months later, you won't remember which server contains the most up-to-date copy of your customization files.

2) When you start copying the entire directory instead of individual files, you could copy all the junk files generated during testing to the next server.  A few months later, you will have a hard time to remember what those junk files are for.

3) Upgrade will be a nightmare.  You will need to figure out which files you need to copy into the next version, which files you need to merge, and how many different variations of configuration/data files you need to extract out of the current version.

By using a single file repository, you can minimize even eliminate the above issues.

Pick up a shared drive as your file repository at a location where you can easily copy files to and from other servers.  Keep the directory structure as identical as possible to the destination servers (such as server\etc\cellname\kb\rules\).  Save only the files you created or modified in the file repository.  Never copy the original BPPM out-of-box files there.

As soon as you finish testing your customization files on a server, always save a copy of most up-to-date files into your file repository.  When you need to deploy files to a new server, always copy files from the file repository, not from another server.  If you are not sure if a server has the most up-to-date version of your customization, simply deploy all your files again from the file repository.  When it is time to upgrade, the files in your repository are the only files you you need to copy/merge into the next version.

Tuesday, November 25, 2014

Total Cost of Ownership of BPPM - Part 7: Best practice - Name convention

Many things you can do in BPPM implementation to lower the total cost of ownership of BPPM.  Some of them are big such as architecture design.  And some of them seem to be small such as name convention.  However small they seem to be, they help minimize 'Winchester House' syndrome and lower the total cost of ownership of BPPM.

I would like to start best practice discussion with name convention.  BPPM doesn't post any requirement on how you name your BPPM components including files, cells, integration services, clusters, CMA tags, etc.  If there is no name convention in place, each person would end up naming BPPM components any way he/she likes.  Sooner or later, your BPPM implementation would become a Winchester House. 

Having a name convention in place at the beginning of BPPM implementation is a small effort but it pays many times back after the implementation is completed.  The name convention needs to be enforced at all time during the implementation and after the implementation.  Trying to rename things afterwards is a painful and error-prone process.

Pick up a 2 or 3-letter prefix for your organization.  It will help distinguish your custom files from BPPM out-of-box files.  For example, I used prefix C1 for CapitalOne and prefix CTI for CitiMortgage. Use this prefix for your custom PATROL KM files, PSL files, cell knowledge base MRL files, BAROC files, shell scripts, perl scripts, batch files, JAVA scripts, etc.

For custom PATROL KM, also pick up a short name for your KM.  I also prefer to use all capitals for KM short names.  Include this short name in all file names related to the KM.  This will make your KM easy to package and deploy without missing any file.  If the KM requires specific pconfig file, I would use the same name convention for the pconfig file.  For example, here are the file names I used for a custom CACHE KM: XYZ_CACHE_main.km, XYZ_CACHE_db.km, XYZ_CACHE_db_collector.psl.  XYZ_CACHE.cfg.  XYZ here is the organization prefix. 

For cell names, never ever reuse the same name in the entire organization.  Since cell names are prompted during BPPM installation, you need to have a discussion and decide on their names before you start installation.  The information needs to be included in cell names include: 1) Environment: dev, QA, or production. 2) Event source: PATROL, SCOM, external, correlation, server, etc. 3) Type: H/A or standalone.  For example, I named the cell located on the first clustered PATROL integration service in dev environment as Dev_PATROL1_HA. 

CMA tags also need to be determined in advance.  It can usually be divided into two different ways: by infrastructure (OS, DB, application) and by user environment. For example, you may have CMA tags like Windows_Base, Oracle_Canada, Peoplesoft_HR, etc.  The less overlap between tags, the easier to use and maintain them.  Be careful when making changes to existing CMA tags after implementation is completed because all PATROL agents with the same tag names will have their pconfig updated automatically. 

Monday, November 17, 2014

Total Cost of Ownership of BPPM - Part 6: Stop building Winchester House

Does your BPPM customization process look like building a Winchester house?

For those who are not familiar with it, Winchester House is a famous house located in San Jose, CA.  This giant 160-room house was gradually built during a period of 38 years with no overall planning and consistency.  Often staircases lead to the ceiling and doors open to solid walls.  Visitors are warned not to wander away from the touring group.  Otherwise they can be lost for hours.

All my "rescue mission" projects involved substantial effort of reverse engineering, re-designing, and re-implementing the previous customization that suffered from "Winchester House" syndromes. "Winchester House" syndromes substantially increase the cost to maintain and extend BPPM customization.

Here are some examples of typical "Winchester House" syndromes:

1) In custom PATROL knowledge modules, some KMs read data input from pconfig, some KMs read data input from external files.  And there is no standard location for external files.

2) In custom BPPM cell knowledge base, some events use mc_object slot to determine their Remedy support group, some events parse the msg string to determine their Remedy support group.

3) In custom BPPM cell knowledge base, different events use different rules for update even their requirements are the same.

The root cause of "Winchester House" syndromes is lack of development experience.  As stated in the previous post, many BMC customers use the same resource for installation and for customization. Maintaining an accurate and up-to-date documentation can help reduce "Winchester House" syndromes.  However the key to eliminate "Winchester House" syndromes is to have an experienced developer to set up a well-defined framework, name convention, and best practices at the beginning of BPPM customization.

Eliminating "Winchester House" syndromes can dramatically lower your total cost of ownership.  We will discuss best practices for BPPM customization in the next few posts.

Monday, November 10, 2014

Total Cost of Ownership of BPPM - Part 5: Have separate resources for customization and operations

I have been to several rescue missions when a customization or integration stopped working on a different type/version of input and the client had no idea how to modify it.  Because there was very little accurate document about the customization or integration, I had to reverse engineer the whole thing from various source code, configuration files, and test cases to understand how the customization or integration was done.  Then I needed to make necessary modification to make the customization or integration work again.  After that it took me some time again to write the complete documents that should have been there at the first place including release notes, deployment guide, and troubleshooting guide.

All the time spent in reverse engineering and re-documentation increased the total cost of the ownership to the client.  Sometimes, the customization or integration had to be completely re-written because the original one was not architected right.  In this case, the total cost of ownership would be increased even more. However this cost along with the headache can be completely avoided if the customization or integration was done right at the first place.

How did this happen? A typical story goes like this: The IT organization hired an employee or long-term consultant responsible for day-to-day operations support including installation, configuration, reporting, user support, and trouble shooting.  At the same time, the same person is required to develop customized solution, to program PSL and MRL, and to integrate 3rd-party monitoring software into BPPM.

We all know operations and development require different skill sets.  A person who is good at repetitive work in day-to-day operations support often does not have enough experience to develop robust and maintainable solution for customization and integration.  A person who is good at developing customization and integration is often bored with day-to-day operations.  Requiring one person to do both operations and development is often the root cause of many troubled BPPM operations. 

As an example, here is job description from a recent job ad:
- Administer and configure Enterprise Systems Management/Monitoring systems such as BMC ProactiveNet Performance Management (BPPM),BMC Event Manager (BEM) and TMART.
- Integrate end to end Infrastructure solutions to BEM for Unified Event Management 
- Integrate enterprise solutions such as Netcool , Omegamon, CA Wilyto BEM/BPPM

Because the hourly rate of operations support staff is lower than a developer and there are many more people with operations support skill than people with development skill, most IT organizations end up hiring an operations support staff and also requires him/her to develop customization and integration.  On the surface, it appears that it would lower the cost due to lower hourly rate.  In reality, the opposite is true.  Due to lack of experience and skill, it takes considerably longer for an operations support staff to complete the development work of  customization and integration.  And the solution is often hard to maintain, extend, or troubleshoot.  The one who developed the solution is often the only person who can understand and maintain it.  The entire solution falls apart when this person moves out of the organization.

To avoid this mistake, it is highly recommended to hire an experienced developer to develop a robust and well-documented solution for your customization and integration requirements.  A good developer can develop a robust solution quickly and hand over his/her finished work to your operations support staff for maintenance.  The hourly rate of a good developer may be higher than an operations support staff, but the time required to complete the development is much less.  Development work is a one-time cost.  When the development is completed, there is no need to keep the developer around anymore.  Do a math on your total budget vs just the hourly rate.  At the end it should cost you less on your total budget by hiring a good developer.

We have been doing customized BPPM development for the last 12 years.  To further eliminate a client risk,  we offer fixed-price quotes on well-defined requirements that will fit your total budget. For example, if you estimate that it will take your operations support staff 6 weeks at $75/hr to finish an integration project, we can take the project to fit your total estimated budget ($75 x 240 = $18,000).  We can afford to do this because we have developed a time-approved methodology for any integration and most of source code has already been developed.  Of course, we will be able to deliver the solution to you in far less than 6 weeks.

Monday, November 3, 2014

Total Cost of Ownership of BPPM - Part 4: Customization in BPPM operations

In the previous posts, we discussed the first two types of BPPM customization: adding new features and integration.  In this post, we will focus on the third type of BPPM customization: automation in BPPM operations.

When you build a house, the time you will be living in that house is much longer than the time you spent to build your house.  It would make sense to add some extra features to the house during the construction to make your life easier after you move in.  For example, you may want to wire for surrounding sound speakers and Ethernet during construction.

The same applies to BPPM implementation. For every BPPM implementation, its costs more in many years of operations than in a few months of implementation.  It would make sense to add the necessary customization during implementation that will cut down the cost of operations.  For every full-time staff you save in operations, that is about $100K saving year and year. 

Although some automation for BPPM operations can be added after implementation is completed, majority of the automation should be planned as part of the implementation.  As an example, I am going to share how our 'cell extension' product has kept large BPPM operations manageable with a small support team.

BPPM support team spends majority of their time to: 1) Gather user requirements on monitoring, email notification, and ticketing assignments; 2) Configure or customize BPPM to meet user requirements; 3) Optionally develop customization in dev environment; 4) Test configuration and customization in UAT environment; 5) Deploy configuration and customization in production environment under change control; 6) Set up blackout period for scheduled and on-demand maintenance windows; 7) Trouble shoot when things don't work as they should be.

Our 'cell extension' product is a BPPM cell knowledge base applied on top of BMC's out-of-box knowledge base.  It includes all the common features in BPPM operations such as event updates, repeats, aggregation, delay, blackout, rewording, email notification, Remedy ticketing, etc.  For PATROL and Portal events, no cell policy configuration or rule programming is required.  For all 3rd-party events, you only need to provide slot mapping from the 3rd-party events by following our clear sample code as a one-time setup.  As an added bonus, it also provides plug-and-play interfaces for 2-way integration with any 3rd-party monitoring software.

Our 'cell extension' product is entirely data driven thus it works for every BMC customer.  It uses a few simple forms for user input.  All these forms can be viewed and updated offline without BPPM access.  An end user can fill up those forms with the help from BPPM support team initially.  Many end users even choose to own those forms and put them under source control system.  Through these forms, an end user can tell BPPM what events they are interested, how they want the message to be formatted, who to email, which group the ticket should be assigned to, etc.  BPPM support team only needs to help end users to fill up the forms or fill the forms for them, check the forms for error, and load those forms directly into BPPM.  And the same forms are used for dev, UAT, and production environment.

Since data input has been dramatically reduced, and there is no need to configure cell policy or develop cell rules, you only need a small BPPM support team to run a large BPPM operations.  Troubleshooting will be much easier since all events follow the same flow.  Human input error will be minimized when data are input only once.  BPPM cells will process events faster since no event policy is used.  A user requirement can be completed in minutes instead of days.

If you are interested in finding out how our 'cell extension' product can work for you, please feel free to contact us by clicking the link on top of this page.

Monday, October 27, 2014

Total Cost of Ownership of BPPM - Part 3: What to customize in BPPM

BPPM customization is essentially development work.  BPPM provides well-structured development framework.  All you need to do in customization is to fill up the holes between frames with code modules.

BPPM uses proprietary programming languages for development.  PSL (PATROL Scripting Language) is used to develop customization in its PATROL component.  MRL (Master Rule Language) is used to develop customization in its BPPM/BEM cell component. We provide training for both PSL and MRL development.

There are three areas you can customize in BPPM to fully realize the ROI from BPPM.

1) Adding new features that are not out of box

Your business may require you to monitor an application that BPPM does not monitor out-of-box.  This requires you to develop a custom knowledge module in PATROL.  Having personally developed more than 100 custom PATROL knowledge modules, I can tell you how much every enterprise can benefit from custom-developed monitoring features.

Your business may require you to aggregate multiple alerts into one incident ticket when they all refer to the same incident.  This requires you to develop event processing rules in BPPM/BEM cell that are not available out-of-box.  Without this customization, each alert will result in a separate ticket.  Someone will have to analyze them and manually merge multiple tickets into one.  This customization will help your IT organization cut down its cost.

2) Integration with non-BMC products

In my opinion, integration gives you the best bang for the buck.  If you are like most enterprise IT organizations, you are already running all kinds of monitoring software as the best of breed such as Solarwinds for network, Oracle OEM for Oracle database, and Microsoft SCOM for Windows servers, etc.  How about open source tools such as Nagios and OpenNMS, security monitoring tools such as ArcSight, your home-grown cron-job scripts, and self-monitoring devices such as APC UPS?

When you integrate these non-BMC products into BPPM/BEM cell, not only you have one GUI to view all alerts together, you can also take advantage of the back-end incident management system connected with BPPM for automated incident ticket creation.  

After I finished IBM Netcool integration with BPPM for a client, they realized how much it saved them from paying for an expensive Netcool-Remedy gateway license and how much more reliable an automatic BPPM integration fail-over is comparing to the manual Netcool-Remedy gateway fail-over.  Then they decided to integrate every non-BMC monitoring software into BPPM for automated Remedy ticketing.

Understandably, this is the area that BMC doesn't want to talk much because it could lower their sales on PATROL license.  

We have been a big believer on the best of breed and integration for the last decade.  We have developed a proprietary methodology to seamlessly integrate any non-BMC monitoring software into BPPM cell.  Our methodology is so reliable, repeatable, and universally adaptable that we are willing to take fixed-price contracts for your integration projects (as well as your other development projects).  Please contact us for price details.

3) Automation in BPPM Operations

This is a lesser known area comparing to the other two.  I will postpone the detailed discussion to the next post.  Basically this is the secret how my client can run a large IT monitoring operation with just a 3-person team comparing to at least 6 people in similar sized organizations.

Monday, October 20, 2014

Total Cost of Ownership of BPPM - Part 2: Customization

BPPM out-of-box solution provides basic monitoring features, so do many other low-priced and open-source monitoring solutions.  The reason you have paid a premium price for BPPM is not just for the brand name, it is for its flexibility to customize the solution to fit your unique business requirements exactly.  That is how you can fully realize the ROI from BPPM. 

Your implementation is not just installation and configuration.  Every enterprise IT ended up customizing BPPM - some did it well planned but some did it as a rescue mission. I have been to enough rescue mission to tell you that a well planned customization will cost you far less time, money, and headache.

One of the major components that will largely affect total cost of ownership is customization.  How can customization affect your total cost of ownership?

1) Negatively, it costs you extra time and money in your implementation project.

2) Positively, it adds the functionaries that your business requires to improve your business bottom line.

3) More positively, it reduces the work in maintenance/operations through automation and consistency so you need a smaller operations team - that is a big cost saving year after year.

Many people might think customization increases maintenance/operations work.  Well, that would be the case if customization were not done right.  I have seen some clean customization that streamlined maintenance/operations work.  I have also seen some messy customization that simply became maintenance nightmare.  Similar to the customization made in your house, if the customization follows standards and consistency, it will reduce maintenance/operations work.  A simple example would be a customization you do to the light switches in your house that allows you to control all lights through a mobile app remotely.  We will talk about standards and consistency in later posts.

Not only you need to budget extra time and money for customization in your implementation project, you also need to hire the right developer for customization because customization is primarily code development and it requires totally different skill set from installation.

Unfortunately customization is often under-addressed by both BMC and BMC customers.  BMC just tells you that BPPM is totally customizable without providing you the necessary training and best practice on customization.  BMC customers are aware of the need for customization but have no solid understanding what detailed features/automation they need and what it takes to get there.

As a consulting practice for the last 12 years, we specialize in BPPM customization.  Please feel free to contact us if we can be any assistance for your BPPM customization.  We also provide training in BPPM customization including advanced MRL programming and PSL programming that are not available in BMC.

Tuesday, October 14, 2014

Total Cost of Ownership of BPPM - Part 1: Introduction

The total cost of ownership of BPPM includes the cost of hardware, software license, implementation, day-to-day operations and maintenance.  The total cost of ownership must be kept in mind starting in the planning phase and through the entire implementation phase as well as the operations phase.  But most importantly, how BPPM is implemented makes a huge difference not just on your implementation cost but also your operations cost year after year.

Implementing BPPM is like building a house. Not only you care how fast the house can be built, you also care if the house has enough features and options to fit your needs.  And more importantly you care how much it will cost you for maintenance, repair, and utilities year after year.

Several years ago when I bought my first newly-built house, I knew very little about total cost of ownership.  Because the house was poorly insulated during the construction, I have been getting costly air conditioning bill every summer.  Had I known what I know now, I would have been more selective when I picked up a builder.

Similarly, if you are responsible for your enterprise system monitoring, you may want constantly ask these questions during implementation: How many people do I need for operations? How much can be automated and how much must be manual work?  What customization do I need?  Can this customization be easily extended in the future?  Are different components in the implementation consistent enough so the knowledge can be transferred easily?  Can most of the operation procedures be created and understood before implementation is complete?

If you hire consultants for BPPM implementation, you need to work closely with the consultants to make sure the total cost of ownership is part of the objectives of the implementation.  Don't assume that a consultant will automatically keep the total cost of ownership in mind since it may require more work during implementation. The difference between a good consultant and a mediocre consultant is not just how fast the implementation can be done, but how the implementation meets your needs and most importantly how much it can save you in operations.

What should be a good size for your operations team?  As an example, one of my clients with 7 monitoring solutions (BMC and other vendors) needs only a 3-person team for the last 5 years to keep mission-critical multi-hospital operations running without experiencing any down time.  And no one needs to work overtime.  I have seen many similar-sized IT operations need 6-10 people in the team just to keep up with operations support.  If you can cut down just one operations staff in the team with a right implementation, that is a minimum $100K saving year after year.

In the next few posts, we will go through some best practices on how to reduce the total cost of ownership.


Monday, October 6, 2014

PatrolCli - Part 7: Dynamically activate diagnostic data collection in PATROL

If you monitor web applications using BMC TM-ART or a similar product, you will receive a critical event in BPPM cell when the response time exceeds your threshold.  When this happens, you would wish that you can collect detailed performance data at EJB level to help you diagnose the poor response time.  So you probably end up developing a custom PATROL knowledge module to collect detailed EJB data from your J2EE application server.

However, you don't want to have your EJB KM running all the time since it is resource intensive.  It would make sense if you can activate and deactivate your EJB KM dynamically.  When your BPPM cell receives a critical event about slow web application response time, your EJB KM will be activated automatically.  When the same web application response time goes back to normal, your EJB KM will be deactivated automatically.

As you can see it here, BPPM cell is the only component that knows when to activate and deactivate your EJB KM because it receives and clears the slow web application response time alert.  

Since PATROL agent is normally installed on BPPM cell server, you can use PatrolCli to activate and deactivate your EJB KM.

For example, if your have a custom PATROL KM called EJB_KM, you can activate it by including the following PatrolCli command in an OS script called activate_ejb_km.cmd located on BPPM cell server:

execpsl "set(\"/EJB_KM/active\", 2);"

Then simply invoke activate_ejb_km.cmd in execute() function of a MRL rule from BPPM cell when BPPM cell receives a critical event for slow web application response time. 

Similarly, you can deactivate EJB_KM by including the following PatrolCli command in an OS script called deactivate_ejb_km.cmd located on BPPM cell server:

execpsl "set(\"/EJB_KM/active\", 0);"

Then simply invoke deactivate_ejb_km.cmd in execute() function of a MRL rule from BPPM cell when the previous critical event is cleared in BPPM cell.

Monday, September 29, 2014

PatrolCli - Part 6: Re-sync parameter status with PATROL agents upon exiting blackout in BPPM cell

Event blackout rule or event blackout policy in BPPM cell is something we all rely on to suppress alerts during regularly scheduled maintenance window.  Upon exiting blackout period, if a PATROL parameter alert (e.g. process down alert) is still present, what should you do?

If you choose to ignore it and the process is still down, no one will be notified.  PATROL agent only generates an alarm event once when a process goes down.  If the process went down during the blackout period with no notification sent from BPPM cell, PATROL agent will never generate another alarm event again if the process remains down after blackout period ended.

If you choose to send a notification for every suppressed alert in BPPM cell upon exiting blackout, you may send out lots of false alarms.  During the maintenance window, many PATROL agents may be restarted as the result of server reboot or PATROL configuration change.  The process that was previously down may be brought up as the result of PATROL agent or server restart.  However a newly started PATROL agent will not generate an OK event since there is no state change on PATROL parameter.

Either way, we have a problem.  The best solution is for BPPM cell to re-check PATROL parameter status for each outstanding alert upon exiting blackout. From all the PATROL users I have talked to, this is one of the most-wanted features for event blackout.  Although this feature doesn't come out of box, you can write your own code using PatrolCli.

For example, you can use the following PatrolCli command to check 'mcell' process status:

PCli% execpsl get("/NT_PROCESS/mcell/PROCStatus/status");
OK

It does require some advanced MRL programming skill to tie everything together.  If you need more help, please feel free to contact us for consulting services.  We have developed a proprietary extension for BPPM cell that have addressed many out-of-box limitations including event blackout.

Monday, September 22, 2014

PatrolCli - Part 5: Remote control PATROL agent recovery actions from BPPM cell

PATROL agent recovery action is a script triggered by a parameter state change.  It is often used to restart a process when it is down or dump diagnostic messages when a performance indicator shows an abnormal value.

PATROL agent recovery action is a powerful feature.  But the drawback is that the triggering parameter value can only be set by the local PATROL agent.  In today's complex enterprise IT, sometimes it requires data/events from multiple servers or multiple monitoring software to determine if a recovery action is necessary.  Sometimes it requires BPPM Analytics to determine that a performance indicator is out of normal range.

When multiple servers, multiple monitoring software, or BPPM Analytics are involved, BPPM cell is the only component with the capability to know that a PATROL agent recovery action is needed.

So how can BPPM cell communicate back to PATROL agent and trigger PATROL agent recovery action?

Since PATROL agent is normally installed on BPPM cell server, you can use PatrolCli to change the state of PATROL parameter remotely from BPPM cell server.  The state change of PATROL parameter will trigger the attached recovery action immediately.

For example, if you have a recovery action attached to parameter /NT_OS/NT_OS/_CollectionStatus, you can change its state to ALARM by including the following PatrolCli command in an OS script called trigger_recovery.cmd located on BPPM cell server:

execpsl "set(\"/NT_OS/NT_OS/_CollectionStatus/status\", ALARM);"

Then simply invoke trigger_recovery.cmd in execute() function of a MRL rule from BPPM cell when BPPM cell determines that it is time to trigger a PATROL recovery action.  This determination can be made by correlating multiple events sent from multiple servers by multiple monitoring software.  This determination can also be made by receiving an intelligent event generated from BPPM Analytics.

Monday, September 15, 2014

PatrolCli - Part 4: Event reporting to help troubleshoot missing events

This happens to all of us: You are expecting a critical event sent from a PATROL agent, but for some reason you don't see it in BPPM/BEM GUI.  Now you need to troubleshoot. 

First, you need to determine if the PATROL agent never sent the event out or if a cell rule/policy dropped the event.  To determine if the PATROL agent has ever sent the event out or not, you can use PATROL console and look through the events using its event manager GUI.  However, this method is not very reliable especially when your PATROL agent generates a lot of events.  If you cannot find the event in PATROL console, it doesn't necessarily mean that PATROL agent didn't send it out.

A quicker and more reliable way to determine if the PATROL agent has ever sent the event out or not is to run an event report using PatrolCli.  Not only it saves you the effort to bring up and log into PATROL console, it will save all events meeting your criteria into a text file so that you can search through it over and over again for multiple events.

Here is an example to generate an event report.  Assuming that you are expecting an event from PATROL agent running on server1 to send an ALARM/Critical event from parameter /NT_LOGICAL_DISKS/C:/LDldDiskTimePercent.  But for some reason you don't see it in BPPM/BEM GUI.  You would like to find out if PATROL agent has ever sent that event to BPPM cell. 

To do so, you can simply start PatrolCli from the server you are currently working on as long as it has a PATROL agent running and has permission to connect to PATROL agent running on server1.  Then you just set up an event filter and run an event dump command to save your event report to a file on your current server.

Myserver> PatrolCli
PCli% open server1 3181
Username: patrol
Password:
PCli% event setfilter 091520002014 "" "" A "" "" "" "" ""
OK
PCli% event dump C:\tmp\events.txt W
OK
 
In this example, the setfilter command set the event filter as all ALARM events starting on Sep 15 20:00:00 2014 until current.  The event dump command specified the location of the event report output file.  The mode 'W' is for write vs mode 'A' for append.

Once the above command has finished, you can open the event report file with a wordpad and search for the events generated from parameter /NT_LOGICAL_DISKS/C:/LDldDiskTimePercent.  For example, you may see your event in your event report like this:

Id          : 2118531
Status      : OPEN
Type        : ALARM
Severity    : 4
Time        : Mon Sep 15 22:46:40 2014
Node        : server1
Origin      : NT_LOGICAL_DISKS.C:.LDldDiskTimePercent
Catalog     : 0
Class       : 11
Description : Alarm #2 of global parameter 'LDldDiskTimePercent' triggered on 'NT_LOGICAL_DISKS.C:'.  69 <= 100.00 <= 100

Monday, September 8, 2014

PatrolCli - Part 3: Run PatrolCli commands from an OS script

In order to execute PatrolCli commands automatically on a schedule, you will need to put them in an OS script - e.g., shell script on UNIX or batch script on Windows.  Once PatrolCli commands are in an OS script, you can schedule to run the script periodically by using any vendor's scheduling utility or native OS scheduling utility (such as cron on UNIX or admin tool on Windows).

PatrolCli has a '-f' option to execute commands in batch mode.  You can use the following format to supply all PatrolCli commands in an input text file instead of typing them interactively at 'PCli%' prompt:

PatrolCli -f input.txt

Often we want to capture the output of the above command and send it out as an email.  Or you can use it as the input to call 'msend' API and generate an event in BPPM cell.  In the following example running on Windows, we want to detect if VSM_VC KM has stopped collecting data for at least 15 minutes on host 'server1'.  If so, an email will be sent from PatrolCli@yourcompany.com to bppm_admin@yourcompany.com.

Step 1: Create a text file called input.txt as follows:

user patrol password
connect server1 3181
execpsl int(time()-get("/VSM_VC/crpvms1vcenter2.yourcompnay.com/Comm_Status/time"));
exit

Step 2: Create a batch script called VSM_VC_Check.cmd as follows:

@ECHO OFF
PatrolCli -f input.txt > output.txt
set /p result= <output.txt
del output.txt

if %result% GEQ 900 (
echo No VSM_VC data collected for %result% seconds > tmp.txt
smail.exe tmp.txt -t bppm_admin@yourcompany.com -s "VSM_VC data collection stopped" -server <SMTP_Server> -f PatrolCli@yourcompany.com
del tmp.txt
)

Step 3: Schedule script VSM_VC_Check.cmd with Windows admin tool to run it periodically (e.g. every 15 minutes).

Monday, September 1, 2014

PatrolCli - Part 2: Monitor PATROL agent health outside of PATROL/BPPM framework

Sometimes, you have a need to monitor PATROL agent health outside of PATROL/BPPM framework either manually or automatically using a different tool.  For example, during the upgrade of BPPM server or PATROL console, you may not be able to access PATROL agents using BPPM console or PATROL console.

Using PatrolCli, you can check PATROL agent health running on a remote server without BPPM console or PATROL console.  Based on the output you have received, you will be able to tell if the PATROL agent running on a remote server is experiencing a problem.

If PATROL agent is down, you will receive an error message similar to the following:

Myserver> PatrolCli
PCli% open RemoteServer 3181
Username: patrol
Password:
Can't connect to RemoteServer (TCP/3181) as patrol : connecting to agent RemoteServer ...

If PATROL agent is running but stopped collecting data, you can use PatrolCli to check the latest timestamp of a common parameter, e.g., CPUprcrProcessorTimePercent on Windows or CPUCpuUtil on UNIX.

Here is an example to retrieve the latest timestamp of CPUprcrProcessorTimePercent. The timestamp is displayed as epoch time.

PCli% execpsl get("/NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time");
1409639472

Comparing the latest timestamp for CPU data collection with current time:

PCli% execpsl time();
1409639562

If the difference between the current time and the latest data collection timestamp is too long (> 10 minutes), you can reasonably conclude that PATROL agent has stopped collecting data.  In our example, 1409639562 - 1409639472 = 90 seconds. The data collection looks good.

In the next post, we will discuss how to run PatrolCli from a script so you can use another scheduling tool such as UNIX cron or Windows admin program to periodically check PATROL agent health automatically.

Monday, August 25, 2014

PatrolCli - Part 1: Introduction

PatrolCli is a command-line program based on PATROL API and it runs on all PATROL agent platforms to provide access to the functions of a PATROL agent.

Unlike ProactiveNet Cli that you use to manage ProactiveNet every day, PatrolCli is not used in a standard PATROL/BPPM operation environment.  That is why many people are not even aware of its existence.  However PatrolCli is a powerful utility.  Understanding when and how to use it can save you time, lower your operations cost, and make your life a lot easier.

You can use PatrolCli to:

1. Access PATROL agents on all platforms.
2. Read PATROL objects from a PATROL agent.
3. Operate on PATROL events.
4. Execute PSL scripts.

In the next few posts, we will go through some examples on how to take advantage of PatrolCli.

PatrolCli is an executable program located at $PATROL_HOME/bin/ directory.  It is installed automatically with PATROL agent installation. You can use PatrolCli to access another PATROL agent in your network without a PATROL console.

Before you start PatrolCli, you need to set up your environment.  On Windows, you need to set up environment variable %PATROL_HOME%.  On UNIX, run '. ./patrolrc.sh' for korn shell or 'source .patrolrc' for C shell.

Here is an example on how to run PatrolCli interactively:

Myserver> PatrolCli
PCli% open RemoteServer 3181
Username: patrol
Password:
PCli% help

Here 'RemoteServer' is the host name of another server with PATROL agent running on port 3181.  When you type 'help' in PatrolCli, it will display all valid commands you can use in PatrolCli.  They include user, connect, open, list, listvars, get, getvar, event, echo, execpsl, help, source, ping, patrol, !!, exit, quit. '!!' command allows you to repeat the last command.  'exit' command will exit PatrolCli.  For each command, you can get more information by typing help followed by that command.  For example, 'help get' will give you more information on get command.

Here is an example on how you can use 'get' command in PatrolCli to display a parameter value (99.904922) and state (OK) without a PATROL console:

PCli% get /NT_CACHE/NT_CACHE/CACcachCopyReadHitsPercent
Mon Aug 25 23:42:58 2014
-1
0
99.904922
OK
3
0
0
100


Monday, August 18, 2014

PATROL Agent 9.5 - Part 7: How to configure a PATROL recovery action to run as a different user?

The information included in this post applies to all versions of PATROL agent, not just version 9.5. 

By default, if you configure a PATROL recovery action to run when a parameter is in ALARM or WARNING state, the recovery action script will be executed by PATROL default account.  What if you want to execute some commands within the recovery action script as a different account?  For example, what if you want to start SQL*Net listener within your recovery action as account 'oracle' instead of 'patrol'?

There is a PSL built-in function called execute.  Its format looks like this:

execute(type, text, [instance], [username], [password]);

For example, you can add the following PSL statement in your recovery action to start the SQL*Net listener automatically when the SQL*Net listener status parameter is in ALARM state:

execute("OS", "lsnrctl start", "", "oracle", "password");

where "password" can be either a clear text or an encrypted password.

To protect the password, you can encrypt it offline.  Then save the encrypted password in a pconfig variable.  Prior to call execute() command, you simply use command get ([variable_name]) to read the password out.

For example, if you have saved your oracle encrypted password in a pconfig variable called "/ORACLE/Encrypted_Password", you can include the following commands in your recovery action PSL script to automatically start SQL*Net when it is down.

pass = get("/ORACLE/Encrypted_Password");
execute("OS", "lsnrctl start", "", "oracle", pass); 

The reason you can use get("/ORACLE/Encrypted_Password") instead pconfig(GET,  "/ORACLE/Encrypted_Password") is because all pconfig variables are loaded into agent name space during PATROL agent start-up.

Monday, August 11, 2014

PATROL Agent 9.5 - Part 6: Do I still need PATROL Event Management KM?

You have heard from BMC for a while that PATROL Event Management KM is being phased out.  You wonder if you still need to run PATROL Event Management KM with PATROL Agent 9.5.

What does PATROL Event Management KM do?  Here are five features offered by PATROL Event Management KM:

1) Check if a server is up and if PATROL agent is up through its AS_AVAILABILITY KM;
2) Manage event generation and format through PATROL console or pconfig variable editing with or without PATROL Notification Server;
3) Set parameter thresholds through PATROL console or pconfig variable editing;
4) Turn off certain parameters through PATROL console or pconfig variable editing;
5) Add recovery actions through PATROL console or pconfig variable editing.

In PATROL agent version 9.5, the above feature 1) is replaced by automatically generated  MC_CELL_HEARTBEAT_FAILURE and MC_CELL_HEARTBEAT_ON events in BPPM cell.  When a server or PATROL agent goes down, you will automatically receive a MC_CELL_HEARTBEAT_FAILURE event.  You no longer need AS_AVAILABILITY KM.

In PATROL agent version 9.5, the above feature 2) is replaced by a set of pconfig variables starting with /EventSetup and /EventRules.  These pconfig variables provide many different ways to control and format the events from PATROL agents.  You no longer need PATROL Notification Server or use PATROL Event Management KM to control and format your events.

In PATROL agent version 9.5, the above feature 3) is replace by "Agent Threshold Configuration" section of CMA (Central Monitoring Administration).  You no longer need to use PATROL Event Management KM to set parameter thresholds.

In PATROL agent version 9.5, the above feature 4) is partially replaced by "Monitoring Configuration" of CMA when you want to turn off the data collection for the entire monitoring type (application class).  However if you only want to turn off certain parameters in an application class, you still need to use PATROL Event Management KM.

In PATROL agent version 9.5, the above feature 5) still cannot be replaced by anything else.  If you want to add a recovery action in PATROL agent without modifying the KM source code, you still need to use PATROL Event Management KM.

In summary, whether or not you can drop PATROL Event Management KM from your PATROL agent 9.5 configuration depends on if you need to turn off individual parameters and if you need to add your own recovery actions.


Monday, August 4, 2014

PATROL Agent 9.5 - Part 5: Data and event filtering

In BPPM 9.5, you can add 'Filtering Configuration' in a monitoring policy using CMA (Central Monitoring Administration).  Would this feature replace the old way you used to configure data and event filtering?

The short answer is no though you can use CMA to enhance the old way as described at the end of this post.  The 'Filtering Configuration' section in a CMA policy controls if data and events will be streamed to BPPM server or not. There are four options:
  • No Filtering - the default (send all data and events)
  • Filter out data (send no data, send all events)
  • Filter out events (send no events, send all data)
  • Filter out data and events (send no data, send no events) 

Even you select 'Filter out data and events', all data and events will still be sent to BPPM server on demand. They are just not sent as streamed.

Then how to stop sending data and events to BPPM server?  Let's take a closer look.

1) Filtering on parameter data

For data you don't need, you don't want the data being collected by PATROL KM at the first place so that you can reduce PATROL agent footprint and resource consumption.  Since CMA only allows you to control at monitoring type (application class) level, you still need to use Event Management KM to control at parameter level.  For example, if you don't want to see parameter CACcachCopyReadsPerSec, your pconfig variable settings would look like this:

"/AS/EVENTSPRING/PARAM_SETTINGS/THRESHOLDS/NT_CACHC/NT_CACHE/CACcachCopyReadsPerSec" = { REPLACE = "0,0 0 0 0 0 0,0 0 0 0 0 0,0 0 0 0 0 0" }

2) Filtering on event forwarding

For events you are not interested such as information and login events, you don't want those events being forwarded to a BPPM integration service or cell. You need to create some pconfig variable settings to activate and deactivate event types.  For example, if you only want to forward type 11 events (value of parameter is in alarm zone) and Notify Events, your pconfig variable settings would look like this:

/EventSetup/Format/BiiP3/catalogs/0/active = { REPLACE = "0"},
/EventSetup/Format/BiiP3/catalogs/0/types/11/active = { REPLACE = "1"},
/EventSetup/Format/BiiP3/catalogs/0/types/11/KM/AS_EVENTSPRING/active = { REPLACE = "0"},
/EventSetup/Format/BiiP3/catalogs/AS_EVENTSPRING/types/NOTIFY_EVENT/active = { REPLACE = "1"}

In the old way, you would save the pconfig settings for the above two cases in PCM (PATROL Configuration Manager) and push them manually to each PATROL agent they apply.  With CMA of BPPM 9.5, you can continue using PCM, or you can save them in 'Configuration Variables' section of a CMA policy so that they will automatically apply to all PATROL agents with matching tags.


Monday, July 28, 2014

PATROL Agent 9.5 - Part 4: How to include annotated data as part of the event?

The information included in this post actually applies to all versions of PATROL agent, not just version 9.5.  

Many PATROL KMs save important information to an annotated data point when changing a parameter status to ALARM or WARNING.  By default, this annotated data is not included in the event sent to BPPM cell.  What if you want to include annotated data in the event sent to BPPM cell so that it will be included in the notification email and incident ticket?

To achieve that, you will need to write a recovery action at PATROL agent.  Although PATROL Event Management KM is no longer required for PATROL agent v 9.5 from architecture point of view, to avoid modifying the original KM code, I still suggest to use PATROL Event Management KM so that you can add a recovery action by simply deploying some pconfig variable settings.

Here is an example of recovery action to include annotated data.  In this example, MS_HW_ENCLOSURE KM is used but you can apply the same concept to other KMs.  MS_HW_ENCLOSURE KM has a parameter named 'Status'.  When a blade goes down, this parameter's status becomes ALARM and an annotated data point is generated.


PATROL_CONFIG
"/AS/EVENTSPRING/MS_HW_ENCLOSURE/__ANYINST__/Status/arsAction" = { REPLACE = "6" },
"/AS/EVENTSPRING/MS_HW_ENCLOSURE/__ANYINST__/Status/arsCmdType" = { REPLACE = "PSL"},
"/AS/EVENTSPRING/MS_HW_ENCLOSURE/__ANYINST__/Status/arsCommand" = { REPLACE = "C:\\BMC\\Patrol3\\lib\\psl\\Annotate_Recovery.psl" }


Then write a PSL script C:\BMC\Patrol3\lib\psl\Annotate_Recovery.psl as follows:
sleep(1);
status = get("/MS_HW_ENCLOSURE/".__instance__."/Status/status");
txt = annotate_get("/MS_HW_ENCLOSURE/".__instance__."/Status");
txt = nthargf(txt, "2-", "\\");
origin = "MS_HW_ENCLOSURE.".__instance__.".Status_Annotate";
event_trigger2(origin, "STD", "41", status, "4", txt);

A few things to notice:
1) PATROL notification server is not required to run the above recovery action. 
2) The sleep(1) in the first line is to include a delay so that the KM has enough time to finish writing to annotated data before annotate_get() is called.
3) The nthgargf() call at the 4th line is to filter out the leading %Text %Infobox string included in every annotated data. 
4) I put 'Status_Annotate' as the last part in event origin so that this event's mc_parameter slot would become 'Status_Annotate' in the cell.  You can use any string you like.
5) The annotated data will be displayed in the event's mc_parameter_value slot.

Monday, July 21, 2014

PATROL Agent 9.5 - Part 3: Do I still need PATROL console with BPPM 9.5?

In BPPM 9.5, many features in PATROL console have been added to BPPM operations console.  I have heard this question over and over again: Do I still need PATROL console?

The answer is that you should still install PATROL console but you may not need to use it all the time and on all PATROL agents.

Let's take a look on what you use PATROL console for.

1) To develop PATROL custom knowledge modules:

Since there is no other way to develop a PATROL custom knowledge modules in BPPM 9.5, you still need to use PATROL classical console.

2) To configure PATROL knowledge modules:

With the newly available menu commands and CMA (Central Monitoring Administration) options, you can configure many PATROL knowledge modules in BPPM 9.5 operations console and CMA.  However for some PATROL knowledge modules, trying to configure them using BPPM 9.5 operations console is still very difficult or impossible.  Some features available in PATROL Central console, such as disabling parameter data collection, are still not available in BPPM operations console except by directly typing the pconfig variable values in CMA.

To avoid frustration, it is recommended to configure non-trivial PATROL knowledge modules using either PATROL Central console or PATROL classical console.  But you only need to configure one PATROL agent for each different configuration.  Carefully save all the related pconfig variables after each different configuration for each PATROL KM.  You may need to import some of them into either PCM (PATROL Configuration Manager) or CMA depending on which one you are using for mass configuration deployment. 

3) To view collected data:

You should be able to view data collected by all PATROL agents on BPPM 9.5 operations console.  For data that you configured as streamed data, you should see them displayed in real-time in BPPM 9.5 operations console as they are stored in BPPM database.  For data that you configured as non-streamed data, you should still be able to see them as BPPM 9.5 operations console retrieves them from PATROL agents on demand.  However non-streamed data may be displayed in BPPM 9.5 operations console with a delay.

4) Trouble shooting:

In theory, you should be able to perform most of the trouble shooting by using BPPM 9.5 operations console.  However due to possible delay of non-streamed data and the fact that PATROL Central console Windows edition is many times faster than BPPM 9.5 operations console (a web GUI), you would most likely prefer to use PATROL console for trouble shooting.  If you need to verify/change some complicated KM configuration, you would have to use PATROL console.

In summary, with BPPM 9.5, you would most likely use your PATROL classical console as developer console, your PATROL Central console as admin console, and your BPPM console as operations console.

Monday, June 30, 2014

PATROL Agent 9.5 - Part 2: When PATROL agent hangs

In the last post, we discussed how to detect when PATROL agent is unreachable including network down, server down, and PATROL agent down.  One of my readers asked me if the same method can be used to detect if PATROL agent hangs.

When PATROL agent hangs, PATROL agent process is still up with a heartbeat.  Therefore the method we discussed in the last post won't be able to tell if PATROL agent hangs.  When PATROL agent hangs, it just stops collecting data.

The best way to detect when PATROL agent hangs is to write your own script to check when the last data collection happened.  Pick up a common parameter with frequent data collection, e.g., /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent on Windows.  By default, CPUprcrProcessorTimePercent is collected every minute.  When you query /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time, you will receive the last data collection time stamp as Epoch time.  Compare this time with the current time stamp, you will know how long it has been since CPU data was last collected.  If it has been a while since CPU data was last collected (e.g. longer than 10 minutes), you can reasonably conclude that PATROL agent hangs.

You can use PatrolCli to query /NT_CPU/CPU__Total/CPUprcrProcessorTimePercent/time.  PatrolCli is an executable program installed automatically with PATROL agent under $PATROL_HOME/bin directory.  It allows you to connect to any PATROL agent in your network with the right password and permission.  And it allows you to run a subset of PSL commands from command line.  Type 'PatrolCli' from the command line and use 'help' command to see all the commands it supports.

Because you are trying to detect if PATROL agent hangs, it would be the best to use a mechanism outside of PATROL framework to schedule your script and send out the alerts.  For example, you can use cron on UNIX or scheduled task on Windows to run your script periodically.  When you detect PATORL agent hangs, you can send alert emails or use msend command to directly report your alerts to BPPM/BEM cell.

I will be out of country and won't be able to post any blogs for the next two weeks.  This blog will resume on July 21, 2014.  Have a great summer!

Monday, June 23, 2014

PATROL Agent 9.5 - Part 1: When PATROL agent is unreachable

I got a question recently regarding to the best way to monitor PATROL 9.5 agent health.  In one of my older post "Monitoring PATROL Agent 9.x Status: Do I need AS_AVAILABILITY KM?", I discussed using MC_CELL_HEARTBEAT_FAILURE event instead of AS_AVAILABILITY KM to detect a condition when a PATROL agent is not responding. This still applies to PATROL agent version 9.5.

Similar to AS_AVAILABILITY KM, there are 3 possible causes when a MC_CELL_HEARTBEAT_FAILURE event is received: PATROL server down, PATROL agent down, or network down. 

MC_CELL_HEARTBEAT_FAILURE events are triggered and reported automatically.  You don't need to configure anything as long as you have the following pconfig variables set in each PATROL agent:

"/EventSetup/Configuration/Format" = { REPLACE = "BiiP3" },
"/EventSetup/Configuration/Key" = { REPLACE = "mc" }
"/AgentSetup/integration/integrationServices"={REPLACE="tcp:ISN1:3183,tcp:ISN2:3183"}

and either

"/EventSetup/Configuration/ForwardEventsToIS " = { REPLACE = " Yes " }
or

"/EventSetup/Configuration/EventCells" = { REPLACE = "ISN1/1828,ISN2/1828" }

When PATROL agent is unreachable,  you will receive a MC_CELL_HEARTBEAT_FAILURE event.  When PATROL agent is reachable again, you will receive a MC_CELL_HEARTBEAT_ON event.  You do need to reword the msg event slot to make the message more meaningful.

We will discuss how to detect when PATROL agent hangs in the next post.


Monday, June 16, 2014

BPPM 9.5 Quick Start - Part 11: High availability consideration

As the final post in 'BPPM 9.5 Quick Start' series, let's look at high availability.  Without a doubt, BPPM 9.5 has made some great progress in high availability by introducing integration service clusters. A BPPM 9.5 integration service node (ISN) can fail over to the other ISN in the cluster seamlessly with no data loss. PATROL agents connected to the ISN cluster can buffer and resend data for up to 30 minutes.

However no progress has been made in the high availability of BPPM server.  Your only option is still a disk-based cluster provided by the operating system such as Microsoft Windows Cluster.  There are a couple of drawbacks in a disk-based BPPM cluster: 1) Up to 10 minutes downtown after the primary server goes down and before the secondary server comes up; 2) High cost - the cost for two servers in the cluster plus the software is usually about the cost of three servers.

If your business cannot justify to implement a disk-based BPPM server cluster especially if you are also required to implement a DR BPPM server in another data center, you would wonder if there are something else you can do to improve the high availability for your overall solution without incurring the cost and complexity of a disk-based server cluster.

In a business, the most critical incidents that violate service level agreement are availability alerts.  Without high availability of BPPM server, you will need to use a pair of high-availability remote BPPM cells instead of the embedded cells on BPPM server to send notifications and initiate incident ticket creation.  Actually as long as those availability alerts do not come out of service models, a pair of H/A BPPM remote cells usually works better than the embedded cells in a disk-based BPPM server cluster because it is a native application cluster.

In order to initiate incident ticket creation on BPPM remote cells, you will need to install IBRSD on these cells.  This step is in addition to installing CMDB integration module on BPPM server.  CMDB integration module automatically includes IBRSD.  Now you can let BPPM server initiate service model related incident ticket creation and let BPPM remote cells initiate incident ticket creation not related to service models.

To address lack of GUI access when BPPM server goes down, you can install BEM version 7.x login server (also called admin server) and BMC Impact Explorer (IX) as an emergency GUI.  Register all your BPPM cells with the BEM login server.  When BPPM server goes down, you can still see all of your BPPM remote cells from IX.  All BPPM 9.x cells display well in BEM 7.x IX.

In BPPM 9.5, the integration service has been made totally stateless so that PATROL data travel through it to BPPM server without stopping.  If BPPM server goes down, PATROL data will be buffered at PATROL agent for up to 30 minutes.  This means no data loss as long as BPPM server is up again within 30 minutes.  Without data, you will lose intelligent events such as anomaly and trend prediction.  If they are not critical to your business for a short period of time, you can save some cost right now by postponing BPPM server H/A implementation until BMC comes up with an application level BPPM server H/A solution.

Monday, June 9, 2014

BPPM 9.5 Quick Start - Part 10: PATROL agent installation and configuration

There are several ways to deploy large number of PATROL agents including creating silent installation package using CMA, using RSCD agents in BMC Server Automation (BSA), and using old PATROL Distribution Server.  But if you just need to install a few PATROL agents in a development environment to understand how BPPM 9.5 works, you can just use PATROL installer to install each PATROL agent interactively.

To get PATROL data and events into BPPM server, the first step in PATROL configuration is to specify its data and event destinations.  Since BPPM 9.5 allows you to send both PATROL data and events to the same port on integration service node (ISN), your typical pconfig setting would be:

"/EventSetup/Configuration/Format" = { REPLACE = "BiiP3" },
"/EventSetup/Configuration/Key" = { REPLACE = "mc" }
"/EventSetup/Configuration/ForwardEventsToIS " = { REPLACE = " Yes " }
"/AgentSetup/integration/integrationServices"={REPLACE="tcp:ISN1:3183,tcp:ISN2:3183"} 

where ISN1 and ISN2 are host names for the first ISN and second ISN in high-availability cluster.

If you don't want to send PATROL events to the same port as PATROL data, you can still use what you had in BPPM 9.0 by replacing
"/EventSetup/Configuration/ForwardEventsToIS " = { REPLACE = " Yes " }
with
"/EventSetup/Configuration/EventCells" = { REPLACE = "ISN1/1828,ISN2/1828" }

Once the above configuration is in PATROL agent, you will see  all PATROL agents linked with their ISNs in CMA GUI display.

Next, you will need to add tags to PATROL agent configuration. Here is the format for a tag in pconfig

"/AgentSetup/Identification/Tags/TAG/<tag_name>" = {REPLACE = "Tag description"}

For example, you may want to define a tag for Windows OS base monitoring.  You will have the following variable in pconfig:

"/AgentSetup/Identification/Tags/TAG/WinBase" = {REPLACE = "Windows OS Base Monitoring"}

where 'WinBase' is the tag name you created, and it has to match a tag specified in a CMA policy in order to receive the configurations contained in that CMA policy.  "Windows OS Base Monitoring" is a description you give to this tag, and it does not have to match anything.

When the tag in PATROL agent pconfig matches the tag in a CMA policy, CMA will push all its configurations contained in that policy to PATROL agent immediately.  Whenever any configuration is updated in a policy, all PATROL agents with matching tags will receive the same update immediately.  No PATROL agent restart is needed.