Willa Ou's BMC TrueSight/BPPM Blog

BPPM (BMC ProactiveNet Performance Management) or TrueSight Operations Management (the rebranded name) suite is the latest solution from BMC Software for enterprise system management. It combines the data analytic engine from ProactiveNet, the event processing engine from BMC Event Manager (BEM), and the server/application monitor from PATROL into one product. This blog is intended to share information and experience on TrueSight/BPPM implementation, customization, and integration.

Tuesday, January 27, 2015

Total Cost of Ownership of BPPM - Part 16: Best practice - Consistency

In my last post, I talked about documentation. As important as it is, it is impossible to document every little thing in your BPPM implementation. Being consistent in your BPPM implementation can reduce the ongoing maintenance and support cost without excessive documentation. A consistent approach can go a long way to reduce the total cost of ownership of BPPM.

Recently I talked to a consulting firm who just won a BPPM migration project. They decided to use their budget to hire an intermediate-level consultant and an entry-level consultant and send them both to work on the project at the same time. I personally would rather use the same budget to hire one expert-level consultant not only to reduce communication channels, but to enforce consistency in the entire project.

What kind of consistency are we talking about? Here are just some examples for the purpose to get you to think more.

1) Design consistency: Do you have a UAT system that is set up as identical as possible to your production system? If you encountered errors in your production system, can you quickly reproduce, debug, fix, and test in your UAT system? Do you have a consistent name convention for various components (cell names, file names, locations, etc.) in your environment so your maintenance staff will know exactly what and where each component is without checking the document every time?

2) Configuration consistency: When you create those monitoring policies in CMA, do you always follow the same approach such as a base policy followed by special case overwrite? Do you limit one monitor solution per policy? Do you name all your policies consistently so that your maintenance staff can tell what each policy is for without taking 5+ minutes to read it? Do you keep the same copy of mcell.dir for all you BPPM cells? Do you try to control all your event blackout in one place instead of everywhere in your system?

3) Coding consistency: When you develop your MRL rules for BPPM cell, do you limit one rule phase in one MRL file? Do you always prefix your custom files to distinguish them from BPPM out-of-box files? For events you no longer need, do you have a consistent decision over the choice between using filter rule, using refine rule with 'drop_new', and using refine rule with '$EV.status = CLOSED"?

4) Documentation consistency: Do you use the same templates when writing documents? Do you always provide release notes and deployment guide? Do you use screen shots consistently? Do you always provide verification steps?

Monday, January 19, 2015

Total Cost of Ownership of BPPM - Part 15: Best practice - Documentation

If you are the one who implements BPPM for your organization or for your client, do you take your time to write accurate and detailed documents for every project? If you are the one who manages BPPM implementation for your organization, do you require your employees or consultants to deliver accurate and detailed documents? Most importantly, do you budget enough time for your employees or consultants to write documents?

Documentation is one of the most overlooked part in BPPM implementation especially when you are under the pressure to meet an aggressive deadline. I have been to several 'rescue' missions when things just fell apart after the person who did the implementation left the organization. With very little documentation, I had to rely on reverse engineering process by going through the entire source code, configuration files, trace files, and running my own test cases just to understand how things were implemented. After that, I would write those missing documents based on my understanding.

Would it be worth taking a few days writing accurate and detailed documents to avoid weeks and months of headache and reverse engineering work?

So what documents do we need in each BPPM project?

1) Architecture Diagrams: If this is a new BPPM implementation or upgrade, you need to draw or redraw architecture diagrams showing BPPM components located on each server. You only need to redraw architecture diagrams if you made architecture changes in your project.

2) Release Notes: No matter how small or how big your project is, you always need to write a release notes. A release notes should include why (business requirements) and what (features added/enhanced/deleted) about this project. You can also include high-level how to help the readers to understand the features involved. Include assumptions, limitations, and supported environments if applicable. Release notes is usually short between one and three pages.

3) Deployment Guide: Another mandatory document is deployment guide. How can another person deploy the new features you just added to a new system even when you are not around? Where are all your files located? What are the prerequisites? What are the verification steps to tell the deployment is successful? Be precise on the order of deployment steps. Be as detailed as possible. I personally like to include screen shots and provide examples. Some of my deployment guides are as long as 250 pages. A deployment guide ensures all deployments will be repeatable.

4) Design Specification: If your project involves development such as custom PATROL knowledge modules or BPPM cell rules/policies, it is a good idea to include a design specification. You don't need to go down to pseudo code level but do include clear comments in your code. Highlights on important formula and communication between different components can make code maintenance by another person a lot easier.

5) Operations Guide: If you expect operations or administration staff to perform certain tasks that are not available in BPPM documents from BMC, you would need to document how to perform those tasks in operations guide. For example, how to re-assign integration service for each PATROL agent to re-balance the load? How to re-sync mcdb between primary and secondary cells if they are out of sync? Operations Guide can also include troubleshooting procedures and frequently asked questions.

Monday, January 12, 2015

Total Cost of Ownership of BPPM - Part 14: Best practice - Capture PATROL agent up/down events

One of the important aspects in reducing total cost of ownership of BPPM is to be able to monitor its own components. One of the most important components in BPPM framework is PATROL agent. If PATROL agent goes down, you would want to be alerted immediately.

There are so many ways to monitor PATROL agents both in and out of BPPM framework. Inside BPPM framework, when a PATROL agent goes down or starts up, an event is automatically generated.

The good news is that this is an out-of-box behavior. You don't have to perform any extra configuration. The bad news is that the generated events are very different in each BPPM version. To make it worse, in BPPM both the severity and the message require some modification before they can be understood in an alert.

Let's take a deep look what are those PATROL agent up/down events in each BPPM version, and how we should modify them.

If you are still using the old bii4P3 to forward your PATROL agent events to BEM/BPPM cell, regardless which version of the cell you use, you can directly send alert without any modification unless you want to add the port number to the message. In addition, the default cell policy will close the corresponding agent down event upon receiving an agent up event.

Your PATROL agent down event with bii4P3:
MC_ADAPTER_CONTROL;
      severity=WARNING;
      mc_object='server1:3181';
      msg='Agent Connection -server1- down.';
END

Your PATROL agent up event with bii4P3:
MC_ADAPTER_CONTROL;
      severity=OK;
      mc_object='server1:3181';
      msg='Agent Connection -server1- open.';
END

If you are running BPPM 9.0 and sending PATROL events directly to BPPM cell, you will need to modify the severity and message. In addition, there is no default cell rule or policy to close the corresponding agent down event upon receiving an agent up event. You will need to write your own.

Your PATROL agent down event in BPPM 9.0:
MC_CELL_HEARTBEAT_FAILURE;
      severity=WARNING;
      cell='PatrolAgent@server1@192.168.2.12:3181';
      msg='Monitored Cell is no longer responding';
END

You need to change the msg to 'PatrolAgent@server1@192.168.2.12:3181 is no longer responding' because it is actually the PATROL agent not responding, not the cell.

Your PATROL agent up event in BPPM 9.0:
MC_CELL_HEARTBEAT_ON;
      severity=INFO;
      cell='PatrolAgent@Chelsea@192.168.2.12:3181';
      msg='Monitored Cell is up again';
END

You need to change the msg to 'PatrolAgent@server1@192.168.2.12:3181 is up again' because it is actually the PATROL agent up again, not the cell. In addition, you need to change severity to OK in order to write your own 'up event closing down event' rule in BPPM cell.

If you are running BPPM 9.5/9.6 and sending PATROL events either through integration service or directly to BPPM cell, you will need to modify the severity and message. In addition, there is no default cell rule or policy to close the corresponding agent down event upon receiving an agent up event. You will need to write your own.

Your PATROL agent down event in BPPM 9.5/9.6:
PATROL_EV;
      severity=INFO;
      mc_origin='server1:3181';
      msg='Start/stop status of agent 'server1' is '0'. Restart flag (0)';
END

This looks worse than its previous versions. :-( You need to change the msg to 'PATROL agent on server1:3181 stopped'. And you also need to change severity to CRITICAL or WARNING.

Your PATROL agent up event in BPPM 9.5/9/6:
PATROL_EV;
      severity=INFO;
      mc_origin='server1:3181';
      msg='Start/stop status of agent 'server1' is '1'. Restart flag (0)';
END

This looks worse than its previous versions. :-( You need to change the msg to 'PATROL agent on server1:3181 started'. In addition, you need to change severity to OK in order to write your own 'up event closing down event' rule in BPPM cell.

Monday, January 5, 2015

Total Cost of Ownership of BPPM - Part 13: Best practice - When not to use out-of-box default values

Happy New Year to everyone!

What an exciting year 2014 was with both BMC BPPM 9.5 and BMC TrueSight 9.6 going strong! I want to take this opportunity to thank you for your continuous trust and support. May 2015 bring you more success and happiness.

To continue our discussion on best practices regarding to the total cost of ownership of BPPM, we are going to talk about when you should not use out-of-box default values in your BPPM implementation. Since there will be almost endless situations when using the default values is not a good idea, let's take a look at three different examples.

The first example is the installation directory path. The default installation directory path for BPPM installation is C:\Program Files\BMC Software\. The spaces used here can potentially make your customization scripts fail. The long path name can potentially truncate your file name in those deep-down sub-directories. In addition, you should use a disk drive other than C to lower BPPM's performance impact on Operating System. Assuming you have a E drive available, a much better installation directory path for BPPM installation would be: E:\BMC\.

The second example is the remote cell names. The default remote cell name supplied by BPPM installer is the hostname where the cell is being installed. If you are setting your remote cells in an application cluster, you need to use the same cell name for its backup cell running on another host. For example, when you are installing remote cells in an application cluster on server A and B, if you take the default cell name, you would name the cell as 'A' for both cells running on server A and server B. This would cause a lot of confusion when configuring the event sources sending events to this cell, not to mention most hostnames are very hard to remember. A better way to name your remote cells is to name it based on environment, function, and/or H/A type. For example, I would name my remote cell receiving Entuity events in an application cluster running on production environment as PRD_ENTUITY_HA.

The third example is the event database size for a remote cell. By default, event database size in mcell.conf is set at 360,000 (EventDBSize=360000). Each PATROL event is about 4K. For 360K events, the maximum event DB size would be 360K * 4K = 1.44GB. If this size is more than half of the RAM size on the server, you would most likely see a long running time for cell's state builder and your cell would be nonresponsive while its state builder is running. Ensure your EventDBSize is less than 50% of RAM size. If you are running multiple cells or sharing the server with BPPM integration service, you need to be even more conservative.

BMC BPPM Consulting