Tuesday, April 28, 2015

Understand BPPM As A Decision Maker - Part 5: Implementaion - thresholds

Once you have decided what to do with installation, console, and data, your next decision is about thresholds.  In my personal opinion, thresholds are the heart of enterprise system management because they determine what alerts and how many alerts you are going to receive.

A threshold is associated with a direction (above or below) and a severity.  If a threshold's direction is above and the severity is critical, when the parameter/attribute value is above the threshold, the threshold is violated and a critical alert is generated.  If a threshold's direction is below and the severity is warning, when the parameter/attribute value is below the threshold, the threshold is violated and a warning alert is generated.

If you are familiar with thresholds in general, keep in mind that thresholds in BPPM are more complicated than most other enterprise system management software due to historical reason.

The data collection agent PATROL was initially architected in 1995 as a completely self-contained system with its own local storage, thresholds, and alert system.  In another word, thresholds can be set in each PATROL agent.

When BMC merged PATROL, BMC Event Manager, and ProactiveNet into BPPM as one product, all or part of the data collected by PATROL agents can be stored in BPPM server database as a duplicate.  Therefore, thresholds can also be set at BPPM server in addition to each PATROL agent.

There are two kinds of thresholds: static thresholds and dynamic thresholds.  Static thresholds can be set in either PATROL agents or in BPPM server or both.  Static thresholds set in PATROL agents and in BPPM server work independently.  If you set static thresholds in both places, it is a manual effort to make sure there is no gap and no overlap between them.  Dynamic thresholds can only be set in BPPM server.

A static threshold has an absolute value.  For example, you can set your free disk percentage threshold at 10% so that you will receive an alert when you have less than 10% free disk space left.  There are many different variations for static thresholds that we will discuss in details in the next post.

A dynamic threshold doesn't have an absolute value by itself.  The threshold value is calculated on the fly based on historical data values from a specified time period (for example, hourly, daily, weekly, etc.)  Dynamic threshold is also called baseline.  For example, you can set your CPU utilization threshold as 10% above hourly baseline so that you will receive a high CPU alert when your current CPU utilization is more than 10% above historical average value for the same period of the day.  There are many different variations for dynamic thresholds and a dynamic threshold can also be combined with a static threshold to make it more flexible.  We will discuss more dynamic thresholds in details in a later post.

As a decision maker, the first thing to do is to determine on what types of data you want to set static thresholds and on what types of data you want to set dynamic thresholds.  Examples on data types include: availability (status), number of errors/failures, percentage of errors/failures, percentage of capacity utilization, resource utilization per server, resource utilization per component, response time, and wait time.  Have a brainstorm session with your implementation team and your end users to list all types of data collected in your environment.  Then make a decision on how you want to set their thresholds.

Monday, April 20, 2015

Understand BPPM As A Decision Maker - Part 4: Implementation - console and data

Once you have decided what to do with installation of BPPM server(s), BPPM integration service(s), BPPM cells, and PATROL local/remote agents, your next decision is about console - the user interface that your operations support staff will use to interact with BPPM on daily basis. And a related decision is about data - what data you want to save in BPPM server database.

When you install BPPM server, you automatically install BPPM operations console, a web interface that displays data and events.  From historical point of view, BPPM operations console is evolved from native ProactiveNet console, as BPPM is an integrated product from PATROL, BMC Event Manager, and ProactiveNet.  What about native PATROL console?  Do you need to install it?

It is BMC's intention to replace PATROL console with BPPM operations console.  At this time, BPPM operations console still cannot completely replace all the features available in PATROL console though your operations support staff can perform most of their work with BPPM operations console. However menu commands that help you configure PATROL KMs interactively or diagnose system issues are only available with PATROL console.

A key differentiator here is where data are stored.  PATROL was initially architected back in 1995 when network was not as fast and reliable as it is today.  All PATROL data are stored locally.  When PATROL was integrated into BPPM, selected or all PATROL data can be saved in BPPM server database.

When you are using PATROL console, you are viewing PATROL data stored on each PATROL agent.  When you are using BPPM operations console, you are viewing PATROL data stored in BPPM server database.  You can use BPPM operations console to view PATROL data stored on each PATROL agent (but not in BPPM server database) as 'non-streamed' data on demand, but the data are about 10-minute-old based on my observation consistently.

Along with a decision on PATROL console, you must decide what PATROL data you want to save in BPPM server database.  There are three major reasons to save PATROL data in BPPM server database: 1) To compute dynamic thresholds - Dynamic thresholds will be discussed in a later post; 2) To be included in reports - Data in report database are based on data in BPPM server database; 3) To see real-time (not 10-minute-old) data in BPPM operations console.

It would be nice to save all PATROL data in BPPM server database so that you don't have to decide which data to eliminate.  If you have a small site, this is entirely possible.  But if you have a large site, you must consider the cost and complexity of multiple BPPM servers as each BPPM server can only contain 1,700,000 attributes or 250,000 instances.  It is a trade-off between having more PATROL data in BPPM server database and keeping the cost of BPPM servers under control.

In summary, in order to decide if PATROL console should be implemented, you need to have an in-depth discussion with your implementation team (consultants or employees) and operations staff regarding to the requirement of PATROL console for PATROL KM configuration, their preferred tools for troubleshooting, and guidelines on what data should be stored in BPPM server database.

Monday, April 13, 2015

Understand BPPM As A Decision Maker - Part 3: Implementation - installation

Many people tend to believe that BPPM implementation is simply BPPM installation.  In fact, installation is only a small part of implementation.  We will discuss some key points about installation here and leave other parts of implementation to the next few posts.

Keep in mind that implementation doesn't start with installation.  It starts with planning.

1) Do not start installation before capacity planning.  You need to decide how many BPPM servers, how many BPPM cells, and how many PATROL agents you need to install.  Not only you need to consider your current business capability, you also need to consider future growth as well.

2) Do not start installation before high availability planning.  You need to decide if you want to have fail-over capability for BPPM servers, BPPM cells, and BPPM remote monitoring agents.  If high availability is desired, you also need to decide the level of high availability (application, OS, or VM level). You must understand what kind of protection each level of high availability offers you.  High availability decision is first determined by your business requirements and then determined by your implementation budget and maintenance budget.

3) Do not start installation before remote monitoring planning.  Although most monitoring happens locally, some monitoring can only be done remotely such as VMWare and ping monitoring, and some monitoring gives you the option between local and remote monitoring such as OS monitoring.  There are pros and cons between local and remote monitoring that you must understand.

4) Do not start installation before you decide how you want to install each PATROL agent.  BPPM uses PATROL agent to collect data.  Although some collection can happen remotely, majority of data must be collected locally.  Installing hundreds and thousands of PATROL agents is a labor intensive work though different installation methods have different requirements.  You can choose manual installation, CMA based installation, server duplication, using old PATROL Distribution Server, using BMC Client Management software, or using BMC BladeLogic client, etc.  You need to understand how long each installation method will take you and if there is an additional software license you must purchase.

5) Do not start installation before you decide how you want to assign agent tags to each PATROL agent.  An agent tag tells what information should be monitored on each server.  For example, your AIX server with Oracle database running should have an agent tag for UNIX operation system and another agent tag for Oracle database.  There are different ways to assign agent tags to each server including as part of installation package, as a post installation script, or through CMA.  But you need to have the decision made in advance - this is what I refer as 'framework'.

6) Do not start installation before you decide a name convention to name each BPPM component in your environment.  Without having a consistent way to name each server, cell, integration service, configuration file in your dev, QA, and production environment, sooner or later, you will find yourself in a big mess.  This is another thing I refer as 'framework'.

Your implementation team (consultants or employees) should be able to describe to you various options as well as pros and cons in the above decisions.  You, as a decision maker, should make the decision together with your implementation team on whether or not and how you should go with BPPM from resource point of view. 

Tuesday, April 7, 2015

Understand BPPM As A Decision Maker - Part 2: Required skills

In the previous post, I listed five aspects to be considered for any enterprise monitoring software.  Because each aspect has a different skill set requirement, it can potentially require separate resource though some resources can be shared.

1) Implementation - requires designing, capacity planning, installation, and framework creation skills.
2) Development - requires designing, coding, and framework creation skills.
3) Integration - requires coding, 3rd-party software, and framework creation skills.
4) Administration - requires user interaction, configuration, and framework following skills.
5) Operation - requires user interaction, monitoring, procedure following skills.

In general, aspects 1), 2) and 3) are the responsibilities of one (or more) implementation team(s).  Aspects 4) and 5) are the responsibilities of separate (or combined) administration and operations team. In this post, I want to spend some time addressing how they relate to each other.  To make it more intuitive, I am using my previous 'home construction' example again.  In home construction, here is how each aspect looks like:

1) Implementation - drawing blueprints and building a house
2) Development - replacing a standard jacuzzi bathtub with a wheelchair-accessible shower
3) Integration - installing solar panels on the roof
4) Administration - installing ceiling fans or repairing a leaking kitchen sink
5) Operation - cleaning floor when it is dirty or turning on/off porch lights at dusk/dawn everyday

For resources, here is how each aspect looks like:

1) Implementation - builder
2) Development - specialized crew from the same builder
3) Integration - specialized crew from the same builder or another company
4) Administration - handyman
5) Operation - homeowner

By the above comparison, here are some general points to help you understand BPPM as a decision maker:

1) Begin with the end in mind: How do you want to use an enterprise monitoring software in your IT organization?  For example, do you want your operations team to use this tool to perform root cause analysis, or do you want to leave root cause analysis to the assigned system administrator or DBA? The more details you know how you will be using the monitoring software, the easier your decision will be.

2) Good implementation is crucial: Implementation is like plumbing of a house.  It would be very expensive to add another bathroom if the plumbing wasn't there at the first place.  If the implementation is not done correctly, you may not have any other choice but redo the entire implementation.  For example, one of my previous clients handed me a partially implemented environment where BPPM and Entuity were running on the same server.  I could not find any way to repair it but had to re-do the entire implementation on two separate servers.

3) Not all BPPM experiences are the same: BPPM experience in implementation and BPPM experience in administration are two different kinds of experience though they share some common technical skill sets.  An implementation team's goal is to create a consistent framework that administration team and operation team can adapt quickly without consulting them so they can move on to another implementation project.  An administration team's goal is to use special knowledge in configuration so that they have 'job security' to stay forever.

4) One time vs repetitive work: Aspect 1), 2), and 3) are part of implementation project.  They are one time cost.  Aspect 4) and 5) are repetitive operations.  They are recurring cost.  If an implementation project is done right, it can cut down tremendous amount of recurring operation cost because the amount of work required for administrators and operators is minimized.  The recent trend from my observation seemed to indicate that many organizations didn't have their BPPM implementation done right and now they are trying to hire the best full-time administrators to make it up. Keep in mind, the supply of 'super handyman' is very limited.  At some point, it may cost less to re-do the incorrect implementation than to rely on a permanent 'super handyman'.