Tuesday, April 28, 2015

Understand BPPM As A Decision Maker - Part 5: Implementaion - thresholds

Once you have decided what to do with installation, console, and data, your next decision is about thresholds.  In my personal opinion, thresholds are the heart of enterprise system management because they determine what alerts and how many alerts you are going to receive.

A threshold is associated with a direction (above or below) and a severity.  If a threshold's direction is above and the severity is critical, when the parameter/attribute value is above the threshold, the threshold is violated and a critical alert is generated.  If a threshold's direction is below and the severity is warning, when the parameter/attribute value is below the threshold, the threshold is violated and a warning alert is generated.

If you are familiar with thresholds in general, keep in mind that thresholds in BPPM are more complicated than most other enterprise system management software due to historical reason.

The data collection agent PATROL was initially architected in 1995 as a completely self-contained system with its own local storage, thresholds, and alert system.  In another word, thresholds can be set in each PATROL agent.

When BMC merged PATROL, BMC Event Manager, and ProactiveNet into BPPM as one product, all or part of the data collected by PATROL agents can be stored in BPPM server database as a duplicate.  Therefore, thresholds can also be set at BPPM server in addition to each PATROL agent.

There are two kinds of thresholds: static thresholds and dynamic thresholds.  Static thresholds can be set in either PATROL agents or in BPPM server or both.  Static thresholds set in PATROL agents and in BPPM server work independently.  If you set static thresholds in both places, it is a manual effort to make sure there is no gap and no overlap between them.  Dynamic thresholds can only be set in BPPM server.

A static threshold has an absolute value.  For example, you can set your free disk percentage threshold at 10% so that you will receive an alert when you have less than 10% free disk space left.  There are many different variations for static thresholds that we will discuss in details in the next post.

A dynamic threshold doesn't have an absolute value by itself.  The threshold value is calculated on the fly based on historical data values from a specified time period (for example, hourly, daily, weekly, etc.)  Dynamic threshold is also called baseline.  For example, you can set your CPU utilization threshold as 10% above hourly baseline so that you will receive a high CPU alert when your current CPU utilization is more than 10% above historical average value for the same period of the day.  There are many different variations for dynamic thresholds and a dynamic threshold can also be combined with a static threshold to make it more flexible.  We will discuss more dynamic thresholds in details in a later post.

As a decision maker, the first thing to do is to determine on what types of data you want to set static thresholds and on what types of data you want to set dynamic thresholds.  Examples on data types include: availability (status), number of errors/failures, percentage of errors/failures, percentage of capacity utilization, resource utilization per server, resource utilization per component, response time, and wait time.  Have a brainstorm session with your implementation team and your end users to list all types of data collected in your environment.  Then make a decision on how you want to set their thresholds.

No comments:

Post a Comment