Willa Ou's BMC TrueSight/BPPM Blog

BPPM (BMC ProactiveNet Performance Management) or TrueSight Operations Management (the rebranded name) suite is the latest solution from BMC Software for enterprise system management. It combines the data analytic engine from ProactiveNet, the event processing engine from BMC Event Manager (BEM), and the server/application monitor from PATROL into one product. This blog is intended to share information and experience on TrueSight/BPPM implementation, customization, and integration.

Monday, May 18, 2015

Understand BPPM As A Decision Maker - Part 8: Implementaion - combination of static and dynamic thresholds

We have gone through the details of static and dynamic thresholds in the last two posts. In addition to set static thresholds and dynamic thresholds separately, you can also combine them on a BPPM server to add more flexibility in your threshold settings.

The first option is to add a dynamic adjustment to a static threshold. In order to do that, you must set your static threshold at BPPM server not at each PATROL agent. In addition to severity, duration, and threshold value that included in a normal static threshold, you can also add a dynamic adjustment here by specifying if the threshold violation also has to be outside a baseline. You can select auto, hourly, daily, weekly, hourly+daily, and all baseline.

An example of the first option would be to set up a threshold for the number of login errors in the last data collection. If you want to set a static threshold as 3, you may want to add 'outside auto baseline' as a dynamic adjustment so that the alert won't be raised if the baseline during that time of day (such as 9am) is 4.

The second option is to add a static adjustment to a dynamic threshold. In addition to severity, duration, baseline, sampling window, absolute deviation, and percent deviation, you can also add a static adjustment here by specify if the threshold violation also has to violate a static threshold value.

An example of the second option would be to set up a threshold for CPU utilization. If you want to set a dynamic threshold as outside of auto baseline for 10 minutes with percent deviation as 15%, you may want to add a threshold value 50 as a static adjustment so that the alert won't be raised when the CPU utilization is 45% for 10 minutes even the baseline is 35%.

You may wonder what the difference is between the first option and the second option. When should you use static threshold with dynamic adjustment and when should you use dynamic threshold with static adjustment?

Dynamic threshold with static adjustment contains deviation in absolute value and in percent value. This feature is not available with static thresholds. Using deviation in a dynamic threshold gives you a cushion or buffer when comparing to a baseline. I personally find this feature very useful and I use deviation in most of my dynamic thresholds with or without static adjustments.

Static threshold with dynamic adjustment contains a 'predict' feature. This feature is not available with dynamic thresholds. Using 'predict' feature in a static threshold allows you to receive a predictive alert when an attribute with fixed-capacity is approaching its limit. This is very useful for attributes such as disk space utilization.

As a decision maker, you will need to determine if you need to combine static thresholds and dynamic thresholds to add more flexibility to your thresholds. If so, you will also need to decide which way to go: to add dynamic adjustment to a static threshold, or to add static adjustment to a dynamic threshold.

Tuesday, May 12, 2015

Understand BPPM As A Decision Maker - Part 7: Implementaion - dynamic thresholds

As mentioned previously, a dynamic threshold doesn't have an absolute value by itself. The threshold value is calculated on the fly based on historical data values from a specified time period (also called baseline). A dynamic threshold needs to contain the following details:

1) Duration: How long does the threshold need to be violated before an alert will be raised? By default, the duration is 0, meaning as soon as the threshold is violated an alert will be raised immediately.

2) Baseline: You can choose hously, daily, weekly, hourly & daily, and all baselines. The default is auto baseline, meaning that BPPM server will automatically choose the best baseline for you.

3) Sampling Window: How long does a parameter/attribute value must be collected before an alert can be raised? The default is 10 minutes or 5 data points, whichever is the longest.

4) Absolute Deviation: How much in absolute value does the parameter/attribute value must be above or below the threshold before an alert can be raised? The default is 1.

5) Percent Deviation: How much in percentage does the parameter/attribute value must be above or below the threshold before an alert can be raised? The default is 5%.

For example, you may want to set a dynamic threshold for your web transaction response time as follows: 1) Duration = 5 minutes; 2) Auto baseline; 3) Sampling window = 10 minutes; 4) Absolute Deviation = 1; 5) Percent Deviation = 40%. If it normally takes 5 seconds to complete a web transaction during the same time of the day, but now it takes 7 seconds (40% more than 5 seconds) consistently for the last 5 minutes, an alert will be raised.

As with a static threshold, a dynamic threshold can also have three different scopes: global, local, and instance.

Dynamic thresholds can only be set at BPPM server. You can choose to use either BPPM operations console or CMA to set a dynamic threshold. In BPPM operations console, you can use either options menu or tools menu. In CMA, you can use global thresholds method or CMA policies. If you use both BPPM operations console and CMA, the thresholds will be combined. In case of conflict, the thresholds set by CMA will override the thresholds set by BPPM operations console.

In order to set a dynamic threshold in BPPM server, the parameter/attribute value must be stored in BPPM server database, meaning that the data must be streamed. By default, all PATROL data are streamed to BPPM server database. But you may want to filter out some data in order not to exceed 1.7 millions of attributes capacity per BPPM server.

As a decision maker, you will need to come up with detailed specification (duration, baseline, sampling window, absolute and percentage deviation) after you decide on using dynamic thresholds for some data. You will need to decide if you need local or instance thresholds in addition to global thresholds. You will also need to decide which method to use - BPPM operations console or CMA.

Tuesday, May 5, 2015

Understand BPPM As A Decision Maker - Part 6: Implementaion - static thresholds

In the previous post, we discussed static thresholds and dynamic thresholds in general. Since there are many different variations of static thresholds, we are going to look into the details.

A static threshold can have three different scopes: global, local, and instance.

A static threshold with global scope applies to all servers and all instances in your environment. For example, a global critical threshold with service status = 3 means that if the parameter status is equal to 3 for any service running on any server, a critical alert will be raised.

A static threshold with local scope applies to one particular server. For example, a local critical threshold with free disk space percentage <15 means that if the parameter 'free disk space percentage' is below 15% for any disk running on this particular server, a critical alert will be raised. A local threshold will always override the global threshold. In this example, the global threshold could be free disk space percentage <10. But because the applications running on this particular server tend to fill up disk space much faster than other servers, you may want to use a more conservative local threshold.

A static threshold with instance scope applies to one particular instance. For example, a instance critical threshold with free disk space percentage <20 means that if the parameter 'free disk space percentage' is below 20% for one particular disk (e.g. C drive) running on any server, a critical alert will be raised. A instance threshold will always override the global threshold. In this example, the global threshold could be free disk space percentage <10. But because C drive is usually smaller and more critical to keep the server up than other drives, you may want to use a more conservative instance threshold.

As we mentioned in the previous post, a static threshold can be configured at each PATROL agent or at BPPM server or at both places. And BPPM does not relate the static thresholds configured at each PATROL agent with the ones at BPPM server. If you decide to configure static thresholds at both PATROL agents and BPPM server, you need to manually keep tracking them so there won't be any gap or overlap.

You may want to ask: Why not just configure all static thresholds in BPPM server? There are two major limitations for this approach.

The first limitation is that each BPPM server can only store 1,700,000 attributes/parameters in its database. If you have a large environment, you can only store a small subset of your parameters in BPPM server database. In order to configure a static threshold for a parameter in BPPM server, this parameter must be stored in BPPM server database.

The second limitation is that BPPM server still doesn't have an application-level quick fail-over architecture. If BPPM server becomes unavailable, no threshold can be applied and thus no alert can be raised until the OS-based secondary BPPM server is up - which usually takes 10 minutes or longer.

Some BMC customers with small environment in non-critical business did choose to configure all static thresholds in BPPM server. So if that is doable in your environment, you can absolutely configure all static thresholds in BPPM server.

There is another aspect of static thresholds that you can set: duration - how long the threshold has to be violated before raising the alert.

If you set static thresholds at each PATROL agent, the duration is represented by the number of polling cycles. To set your desired duration for a participial parameter, you must know the default polling cycle for that parameter and reset the polling cycle if the default one does not meet your needs. The polling cycle for a parameter determines how often (in seconds) the parameter value will be collected. The combination of polling cycle and the number of polling cycles determines the threshold duration in seconds.

If you set static thresholds at BPPM server, the duration is represented by the number of minutes thus polling cycle is not needed.

Finally if you set static thresholds at each PATROL agent, you can choose to use either pconfig method or CMA method. In pconfig method, you use either PCM (PATROL Configuration Manager) or pconfig scripts. In CMA method, you use CMA policies. If you use both, the thresholds will be combined. In case of conflict, the thresholds set by CMA method will override the thresholds set by pconfig method.

If you set dynamic thresholds at BPPM server, you can choose to use either BPPM operations console or CMA. In BPPM operations console, you can use either options menu or tools menu. In CMA, you can use global thresholds method or CMA policies. If you use both BPPM operations console and CMA, the thresholds will be combined. In case of conflict, the thresholds set by CMA will override the thresholds set by BPPM operations console.

As a decision maker, you can tell by now that there are a lot more decisions to make after you decide on using static thresholds for some data. You will need to decide if you need local or instance thresholds in addition to global thresholds. You will need to decide where you want to set them - at each PATROL agent or at BPPM server. You will need to decide threshold durations. To set static thresholds in PATROL agents, you will need to decide which method to use - pconfig or CMA. To set dynamic thresholds in BPPM server, you will need to decide which method to use - BPPM operations console or CMA.

Tuesday, April 28, 2015

Understand BPPM As A Decision Maker - Part 5: Implementaion - thresholds

Once you have decided what to do with installation, console, and data, your next decision is about thresholds. In my personal opinion, thresholds are the heart of enterprise system management because they determine what alerts and how many alerts you are going to receive.

A threshold is associated with a direction (above or below) and a severity. If a threshold's direction is above and the severity is critical, when the parameter/attribute value is above the threshold, the threshold is violated and a critical alert is generated. If a threshold's direction is below and the severity is warning, when the parameter/attribute value is below the threshold, the threshold is violated and a warning alert is generated.

If you are familiar with thresholds in general, keep in mind that thresholds in BPPM are more complicated than most other enterprise system management software due to historical reason.

The data collection agent PATROL was initially architected in 1995 as a completely self-contained system with its own local storage, thresholds, and alert system. In another word, thresholds can be set in each PATROL agent.

When BMC merged PATROL, BMC Event Manager, and ProactiveNet into BPPM as one product, all or part of the data collected by PATROL agents can be stored in BPPM server database as a duplicate. Therefore, thresholds can also be set at BPPM server in addition to each PATROL agent.

There are two kinds of thresholds: static thresholds and dynamic thresholds. Static thresholds can be set in either PATROL agents or in BPPM server or both. Static thresholds set in PATROL agents and in BPPM server work independently. If you set static thresholds in both places, it is a manual effort to make sure there is no gap and no overlap between them. Dynamic thresholds can only be set in BPPM server.

A static threshold has an absolute value. For example, you can set your free disk percentage threshold at 10% so that you will receive an alert when you have less than 10% free disk space left. There are many different variations for static thresholds that we will discuss in details in the next post.

A dynamic threshold doesn't have an absolute value by itself. The threshold value is calculated on the fly based on historical data values from a specified time period (for example, hourly, daily, weekly, etc.) Dynamic threshold is also called baseline. For example, you can set your CPU utilization threshold as 10% above hourly baseline so that you will receive a high CPU alert when your current CPU utilization is more than 10% above historical average value for the same period of the day. There are many different variations for dynamic thresholds and a dynamic threshold can also be combined with a static threshold to make it more flexible. We will discuss more dynamic thresholds in details in a later post.

As a decision maker, the first thing to do is to determine on what types of data you want to set static thresholds and on what types of data you want to set dynamic thresholds. Examples on data types include: availability (status), number of errors/failures, percentage of errors/failures, percentage of capacity utilization, resource utilization per server, resource utilization per component, response time, and wait time. Have a brainstorm session with your implementation team and your end users to list all types of data collected in your environment. Then make a decision on how you want to set their thresholds.

Monday, April 20, 2015

Understand BPPM As A Decision Maker - Part 4: Implementation - console and data

Once you have decided what to do with installation of BPPM server(s), BPPM integration service(s), BPPM cells, and PATROL local/remote agents, your next decision is about console - the user interface that your operations support staff will use to interact with BPPM on daily basis. And a related decision is about data - what data you want to save in BPPM server database.

When you install BPPM server, you automatically install BPPM operations console, a web interface that displays data and events. From historical point of view, BPPM operations console is evolved from native ProactiveNet console, as BPPM is an integrated product from PATROL, BMC Event Manager, and ProactiveNet. What about native PATROL console? Do you need to install it?

It is BMC's intention to replace PATROL console with BPPM operations console. At this time, BPPM operations console still cannot completely replace all the features available in PATROL console though your operations support staff can perform most of their work with BPPM operations console. However menu commands that help you configure PATROL KMs interactively or diagnose system issues are only available with PATROL console.

A key differentiator here is where data are stored. PATROL was initially architected back in 1995 when network was not as fast and reliable as it is today. All PATROL data are stored locally. When PATROL was integrated into BPPM, selected or all PATROL data can be saved in BPPM server database.

When you are using PATROL console, you are viewing PATROL data stored on each PATROL agent. When you are using BPPM operations console, you are viewing PATROL data stored in BPPM server database. You can use BPPM operations console to view PATROL data stored on each PATROL agent (but not in BPPM server database) as 'non-streamed' data on demand, but the data are about 10-minute-old based on my observation consistently.

Along with a decision on PATROL console, you must decide what PATROL data you want to save in BPPM server database. There are three major reasons to save PATROL data in BPPM server database: 1) To compute dynamic thresholds - Dynamic thresholds will be discussed in a later post; 2) To be included in reports - Data in report database are based on data in BPPM server database; 3) To see real-time (not 10-minute-old) data in BPPM operations console.

It would be nice to save all PATROL data in BPPM server database so that you don't have to decide which data to eliminate. If you have a small site, this is entirely possible. But if you have a large site, you must consider the cost and complexity of multiple BPPM servers as each BPPM server can only contain 1,700,000 attributes or 250,000 instances. It is a trade-off between having more PATROL data in BPPM server database and keeping the cost of BPPM servers under control.

In summary, in order to decide if PATROL console should be implemented, you need to have an in-depth discussion with your implementation team (consultants or employees) and operations staff regarding to the requirement of PATROL console for PATROL KM configuration, their preferred tools for troubleshooting, and guidelines on what data should be stored in BPPM server database.

Monday, April 13, 2015

Understand BPPM As A Decision Maker - Part 3: Implementation - installation

Many people tend to believe that BPPM implementation is simply BPPM installation. In fact, installation is only a small part of implementation. We will discuss some key points about installation here and leave other parts of implementation to the next few posts.

Keep in mind that implementation doesn't start with installation. It starts with planning.

1) Do not start installation before capacity planning. You need to decide how many BPPM servers, how many BPPM cells, and how many PATROL agents you need to install. Not only you need to consider your current business capability, you also need to consider future growth as well.

2) Do not start installation before high availability planning. You need to decide if you want to have fail-over capability for BPPM servers, BPPM cells, and BPPM remote monitoring agents. If high availability is desired, you also need to decide the level of high availability (application, OS, or VM level). You must understand what kind of protection each level of high availability offers you. High availability decision is first determined by your business requirements and then determined by your implementation budget and maintenance budget.

3) Do not start installation before remote monitoring planning. Although most monitoring happens locally, some monitoring can only be done remotely such as VMWare and ping monitoring, and some monitoring gives you the option between local and remote monitoring such as OS monitoring. There are pros and cons between local and remote monitoring that you must understand.

4) Do not start installation before you decide how you want to install each PATROL agent. BPPM uses PATROL agent to collect data. Although some collection can happen remotely, majority of data must be collected locally. Installing hundreds and thousands of PATROL agents is a labor intensive work though different installation methods have different requirements. You can choose manual installation, CMA based installation, server duplication, using old PATROL Distribution Server, using BMC Client Management software, or using BMC BladeLogic client, etc. You need to understand how long each installation method will take you and if there is an additional software license you must purchase.

5) Do not start installation before you decide how you want to assign agent tags to each PATROL agent. An agent tag tells what information should be monitored on each server. For example, your AIX server with Oracle database running should have an agent tag for UNIX operation system and another agent tag for Oracle database. There are different ways to assign agent tags to each server including as part of installation package, as a post installation script, or through CMA. But you need to have the decision made in advance - this is what I refer as 'framework'.

6) Do not start installation before you decide a name convention to name each BPPM component in your environment. Without having a consistent way to name each server, cell, integration service, configuration file in your dev, QA, and production environment, sooner or later, you will find yourself in a big mess. This is another thing I refer as 'framework'.

Your implementation team (consultants or employees) should be able to describe to you various options as well as pros and cons in the above decisions. You, as a decision maker, should make the decision together with your implementation team on whether or not and how you should go with BPPM from resource point of view.

Tuesday, April 7, 2015

Understand BPPM As A Decision Maker - Part 2: Required skills

In the previous post, I listed five aspects to be considered for any enterprise monitoring software. Because each aspect has a different skill set requirement, it can potentially require separate resource though some resources can be shared.

1) Implementation - requires designing, capacity planning, installation, and framework creation skills.
2) Development - requires designing, coding, and framework creation skills.
3) Integration - requires coding, 3rd-party software, and framework creation skills.
4) Administration - requires user interaction, configuration, and framework following skills.
5) Operation - requires user interaction, monitoring, procedure following skills.

In general, aspects 1), 2) and 3) are the responsibilities of one (or more) implementation team(s). Aspects 4) and 5) are the responsibilities of separate (or combined) administration and operations team. In this post, I want to spend some time addressing how they relate to each other. To make it more intuitive, I am using my previous 'home construction' example again. In home construction, here is how each aspect looks like:

1) Implementation - drawing blueprints and building a house
2) Development - replacing a standard jacuzzi bathtub with a wheelchair-accessible shower
3) Integration - installing solar panels on the roof
4) Administration - installing ceiling fans or repairing a leaking kitchen sink
5) Operation - cleaning floor when it is dirty or turning on/off porch lights at dusk/dawn everyday

For resources, here is how each aspect looks like:

1) Implementation - builder
2) Development - specialized crew from the same builder
3) Integration - specialized crew from the same builder or another company
4) Administration - handyman
5) Operation - homeowner

By the above comparison, here are some general points to help you understand BPPM as a decision maker:

1) Begin with the end in mind: How do you want to use an enterprise monitoring software in your IT organization? For example, do you want your operations team to use this tool to perform root cause analysis, or do you want to leave root cause analysis to the assigned system administrator or DBA? The more details you know how you will be using the monitoring software, the easier your decision will be.

2) Good implementation is crucial: Implementation is like plumbing of a house. It would be very expensive to add another bathroom if the plumbing wasn't there at the first place. If the implementation is not done correctly, you may not have any other choice but redo the entire implementation. For example, one of my previous clients handed me a partially implemented environment where BPPM and Entuity were running on the same server. I could not find any way to repair it but had to re-do the entire implementation on two separate servers.

3) Not all BPPM experiences are the same: BPPM experience in implementation and BPPM experience in administration are two different kinds of experience though they share some common technical skill sets. An implementation team's goal is to create a consistent framework that administration team and operation team can adapt quickly without consulting them so they can move on to another implementation project. An administration team's goal is to use special knowledge in configuration so that they have 'job security' to stay forever.

4) One time vs repetitive work: Aspect 1), 2), and 3) are part of implementation project. They are one time cost. Aspect 4) and 5) are repetitive operations. They are recurring cost. If an implementation project is done right, it can cut down tremendous amount of recurring operation cost because the amount of work required for administrators and operators is minimized. The recent trend from my observation seemed to indicate that many organizations didn't have their BPPM implementation done right and now they are trying to hire the best full-time administrators to make it up. Keep in mind, the supply of 'super handyman' is very limited. At some point, it may cost less to re-do the incorrect implementation than to rely on a permanent 'super handyman'.

Monday, March 30, 2015

Understand BPPM As A Decision Maker - Part 1: Introduction

Recently I have received several questions regarding to the high level technical information to help determine if BPPM is the right monitoring tool for an IT organization and if so how resources should be planned.

Realizing majority of the information out there is more for marketing purpose, I thought it would be beneficial to address some high level technical key points in BPPM that marketing people don't know how to talk about and technical people don't want to talk about.

As you may have known, BPPM is not one single product but several products integrated (or 'stitched') together. The core components of BPPM include PATROL, BMC Event Manager (BEM) & Service Impact Manager (SIM), and ProactivetNet. Each core component was a separate product acquired from a separate company at a separate time in 1994, 2003 and 2007 respectively.

BPPM made its first release 8.0 in 2010. In 2014, BPPM (BMC ProactiveNet Performance Management) was re-branded with a new name called TrueSight (BMC TrueSight Operations Management) after integrating with another acquired product 'TrueSight IT Data Analytics'.

Since BPPM 8.0, BMC has made several releases. The current release is 9.6. Between 8.0 and 9.6, BMC made major architecture and feature changes in release 8.6, 9.0, and 9.5. When a new release contains major architecture changes, it means that you cannot take advantage of the new features available in the new release without re-implementing the most part of the existing BPPM implementation.

When trying to understand any monitoring software for your IT organization, there are primarily five aspects you need to look into:
1) Implementation (both new implementation and upgrade)
2) Development (adding or modifying features not available out-of-box)
3) Integration (integrating with other monitoring software and ticketing software)
4) Administration (configuration and troubleshooting on demand)
5) Operation (performing pre-configured actions repeatedly)

You need to consider all five aspects for technical information when making a decision on purchase and resource planning. Of course, you need to consider other non-technical information too such as support, documentation, license cost, etc.

In the next few posts, I will go into depth sharing my experience on these five technical aspects with you.

Monday, March 23, 2015

Project Management in BPPM Implementation - Part 8: Common mistakes in BPPM project management

As the final post in project management series, let's talk about the common mistakes in BPPM implementation projects. Those common mistakes cause BPPM implementation projects behind schedule, over budget, and not meeting the business requirements.

1. No planning phase: This mistake is usually seen with new BPPM customers. Many new BPPM customers start BPPM implementation by immediately jumping into installation without architecture and capacity planning. 9 out of 10 times, whatever they had installed ended up being uninstalled and reinstalled after they eventually figured out the right architecture and capacity.

2. Not planning the initial implementation phase based on available budget: This mistake is usually caused by unrealistic expectation. For example, one BPPM customer I talked to wanted to complete the entire BPPM implementation on $23K budget. If you scale down the initial implementation phase on both features and deployed servers to fit available budget, you can have a small but working BPPM implementation in production. Once BPPM starts adding values in your business, you have a much better chance to receive more budget for the next phase. A BPPM implementation that had to be stopped half way when running out of budget is a big waste of time and money.

3. No WBS (Work Breakdown Structure): This is usually caused by lack of BPPM implementation experience. Without WBS, it is very difficult to track progress and forecast schedule/budget. Not only you need to know what tasks each work package should include, you also need to know how to verify the completion of all tasks in each work package.

4. Hiring one external resource to implement multiple product lines: Each BMC product line has its unique best practices and complexity. It is rare that one consultant who has top expertise on both BPPM and another product line such as ITSM. Just as you see separate doctors for your dental and vision needs, it is much better to hire different resources to implement different product lines.

5. Hiring multiple external resources to implement single product line: Many people believe doubling the number of people in the project will cut the project time in half. That is not true in BPPM implementation. First, more resources increase the number of communication channels as the number of communication channels = n*(n-1)/2. More importantly, due to rapidly added new features and the uniqueness of each IT environment, 'road bumps' are inevitable in your BPPM implementation. Each 'road bump' can delay your project in several days. When you have multiple external resources, each one of them needs to go through the 'road bumps' individually. Having a single external resource can minimize the delays caused by those 'road bumps' thus minimize the total number of man-hours (cost) to complete the project. It is better to use internal resources for on-demand help such as PATROL agent deployment than hiring additional dedicated external resources.

6. Completely hands-off when outsourcing: Outsourcing is a great way to jump start your BPPM implementation project. But outsourcing is not a hands-off process. You must be proactive in requirement analysis, risk management, and knowledge transfer. Remember that your outsourcing organization has different goals and priorities than yours. And your organization is ultimately responsible to see the positive ROI for your BPPM implementation. Perform each milestone inspection before moving forward in your BPPM implementation. If you don't have BPPM expertise in house, hire an expert from a 3rd party to help perform milestone inspection.

Monday, March 16, 2015

Project Management in BPPM Implementation - Part 7: Get an expert for milestone inspection

In today's economy, IT organizations are trying to do more with less. Even you understand the complexity and magnitude of your BPPM implementation project, you may not be able to get as much budget as you requested. For example, one of the BPPM customers told me that they could only get $23K for the entire BPPM implementation project comparing to $230K.

With limited budget, when it comes to hire consulting resources for a BPPM implementation project, many IT organizations have to hire the ones with the lowest cost even they are not confident about the resources' competencies. What is worse is that they are not aware of many problems in their BPPM implementation until the project is claimed to be "completed". Then they would have to re-do part of the implementation if they are lucky enough to get additional budget, or live with the crippled implementation as many IT organizations won't be able to get additional budget.

The #1 reason that many BPPM implementation projects are behind schedule and over budget is re-work. For example, hundreds of PATROL agents were deployed 2-3 times because the PATROL agent deployment packages had to be rebuilt after they were already deployed. If you can minimize re-work, you would have a much better chance to complete your BPPM implementation project on time, within budget, and with high quality.

The most effective way to minimize re-work is through milestone inspection. At each project milestone, have an expert to perform inspection on what has been completed and look for any issues or potential issues. For the above example, your inspector should perform quick check for anything that may cause re-work including product versions, product patches, installation accounts, installation directories, integration services, cells, CMA tags, etc.

It would be great if you have BPPM expertise in house. If not, it is the best to hire an external expert who has been doing BPPM implementation for a long time. Because it doesn't take long to perform a milestone inspection, this would be a small investment. But this small investment will save you lots of major headaches in the long run.

This is very similar to building a home. Several years ago, when I was building my first home, I knew nothing about home construction. But I hired Bob as my home inspector. Bob worked for home builders for many years before becoming a home inspector. Based on his inspection results, I was able to get my builder to fix a long list of problems and potential problems before closing my home. My small investment of hiring a home inspector has saved me from costly home repairs.

If your consulting resources are sent from a consulting company, performing milestone inspection by an expert in house or from a 3rd party can often make the consulting company raise your priority and send you their best consultants.

Monday, March 9, 2015

Project Management in BPPM Implementation - Part 6: Risk management

As with all projects, a BPPM implementation project will bring changes and risk is a prominent aspect of changes. Project risk management helps you identify, describe, prioritize, quantify, respond, and manage the risks.

A risk is a future event that, if it occurs, will affect one or more project objectives. It is much better to start your risk management in the planing phase so that you can anticipate future problems to avoid being surprised in the middle of implementation phase and have an action plan to solve problems or prevent problems from happening.

BPPM implementation project risk management discussed in this post is a simplified version for the purpose of illustration. BPPM implementation project risk management in real life can be much more complex.

1. Risk identification

In a BPPM implementation project, a risk can be product related, resource related, or environment related. Review all project documents and pay special attention to WBS (Work Breakdown Structure) and project assumptions. Identify all risks that may impact project scope, time, cost, and quality. Document causes and effects for each risk.

Because each version of BPPM introduces quite a lot of brand new features but not all BPPM documents are updated with these new features, product related risks are inevitable. One example of product related risks is that you may not be able to configure a PATROL KM in CMA because the configuration steps described in KM document are for PATROL console only. Now your scheduled 1-day task in WBS would take you 3 days to complete because you have to get BMC support involved. This will obviously impact your project completion date and cost.

Resource related risks can be caused by the lack of experienced employees/consultants as BPPM expertise is still rare. This is especially true if you rely on a consulting firm to supply you consultants but had no opportunity to evaluate their competencies in advance.

Environment related risks can be caused by the dependency on another project - for instance, automated BPPM ticketing depends on the completion of Remedy incident management project.

2. Risk analysis

In risk analysis, you evaluate each risk and decide on its likelihood/probability of occurrence and its impact magnitude. You can use 5-level scale (very low, low, medium, high, very high) or 3-level scale (low, medium, high) for likelihood and impact. The result of likelihood x impact is risk exposure. Risk exposure helps to prioritize each risk.

For example, if the risk of not being able to configure a PATROL KM in CMA without contacting BMC support has a very high likelihood of occurrence, and the magnitude of its impact is also very high (delaying project completion for 2 days), the risk exposure will be very high and you need to assign a very high priority to this risk.

3. Risk response

In risk response, you determine options and select actions to reduce the threats to project objectives. These include mitigation plans to reduce the likelihood of risk occurrence, contingency plans to reduce the impact, fallback plans, and workarounds.

For example, for the risk of lacking experienced consultants when you rely on a consulting firm to supply you the resources, you may want to interview the assigned consultants before the project starts and also ensure they are dedicated resources to your project.

4. Risk monitoring and control

Risk management is an on-going process. During BPPM implementation project, you need to update the risks that have occurred and that will no long occur, update risk analysis and response as needed, evaluate the effectiveness of risk response actions, and identify new risks.

Monday, March 2, 2015

Project Management in BPPM Implementation - Part 5: A Work Breakdown Structure (WBS) example

In a previous post, we discussed two common methods to determine the time and cost of a BPPM implementation: WBS (Work Breakdown Structure) and historical average. Regardless which method you used when you requested your budget, WBS is required when you start the initial implementation phase of a BPPM project.

WBS decomposes a project into smaller work packages. And each work package has its own verifiable deliverables. WBS allows you to plan resources, track progress, and make adjustment to the time and cost estimates.

Assuming that the planning phase is completed successfully with an architecture diagram and all the required hardware systems are in place, here is an example of WBS for the initial implementation phase of a BPPM project. Please keep in mind that this is still a high-level WBS and you can decompose each task into several tasks.

1. Install BPPM components with required high availability

2. Configure BPPM Components and inter-component communication

3. Create PATROL installation packages

4. Configure staging integration service and create staging policies

5. Deploy PATROL installation packages on a few test servers

6. Install PATROL classic console or PATROL central console

7. Create monitoring policies in CMA for operating systems, log files, and a few selected databases/applications

8. Create tagging and grouping policies in CMA

9. Test monitoring solutions on deployed test servers

10. Deploy PATROL installation packages to a group of selected production servers

11. Test all configured monitoring solutions on all selected production servers

12. Customize PATROL agent thresholds

13. Create server thresholds (absolute and signature) for each monitoring solution

14. Create cell rules/policies to process PATROL agent events at each integration service cell

15. Create cell rules/policies to process intelligent events at server cell

16. Create cell rules/policies to process self-monitoring events

17. Propagate events to server cell

18. Set up event (or data) blackout

19. Integrate BPPM with LDAP (optionally)

20. Create user interface for administrators, operators, and developers

21. Set up alert email or other notification method to end users in selected business units

22. Deploy BPPM to production servers

23. Perform user acceptance test on production BPPM

24. Complete documentation

25. Perform knowledge transfer

BMC BPPM Consulting