Monday, July 29, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 5: Remedy integration IBRSD

In a previous post (Part 2), I mentioned that all BEM level emails, tickets, and actions need to take place on remote cells in order to meet no-downtime requirement for critical events in a hospital environment. This means that IBRSD for BEM-level events should not be located on BPPM server due to its up to 10 minutes down time during failover. We also wanted to offload as many components as possible from BPPM server to improve the performance. In our BEM 7.4 implementation, we had 2 instances of IBRSD installed on 2 of our cell servers to achieve active/active high availability and load balance. They handled all our ticket creation and update well. We decided to keep the same architecture in BPPM 9.0.

However IBRSD is available only as part of BPPM server installation package in BPPM 9.0 while we needed it as part of BPPM agent installation package. We contacted BMC support but was told that they were not able to help at all. They did enter an enhancement request so hopefully we can see it packaged with BPPM agent in the future releases. Meanwhile we had to come up with a different way to install IBRSD on our cell servers. We copied the entire installation directory of IBRSD from BPPM server, added a few environment variables, and configured a new IBRSD instance in the copied directory. Fortunately the instances on both cell servers worked well.

By now I talked about how we architected BPPM server, BPPM agents, BPPM cells, and IBRSD in our environment for high availability, scalability, and performance. We used Microsoft Windows Clusters for BPPM server and BPPM agents. We used native application clusters for BPPM cells. We installed BPPM agent and integration service on integration service node. We installed BPPM cells, BMC Event Adapter, BMC Event Log Adapter, and IBRSD on cell server. By keeping BPPM cells completely separated from BPPM agents, not only we eliminated down time for BPPM cell failover, we also minimized the down time for BPPM agent failover. In addition, it offers better BPPM cell data protection by having duplicated event repositories. As an added bonus, it cost less since we needed fewer Microsoft Windows Cluster licenses.

Our implementation is somewhat different from what BMC recommends. On various documents and best practice webinars, BMC recommends to co-locate BPPM agents and BPPM cells on the same server and use disk-level OS clusters to achieve high availability. Had we gone with that recommendation, we would experience not only longer down time for failover, but also increased risk that another cell may fail at the secondary node.

Here are the lessons learned so far: To realize the highest ROI on BMC Software investment, business requirements should drive technical design. It is important to evaluate all options through due diligence. Performing due diligence does require the support from the management of the organization and systematic approach to test and verify the proposed model. Sometimes we need to think out of box as shown in the IBRSD example.


Monday, July 22, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 4: BPPM agent and integration service

Based on 250K attributes per BPPM agent sizing guideline from BMC, we installed 2 BPPM agents on 2 Windows 2008 servers - one for Portal data and the other for PATROL data. As we continue adding more PATROL agents, we will add another BPPM agent on the 3rd server once the number of PATROL agents and attributes exceeds the current capacity.

For high availability, we decided on disk-level OS cluster for BPPM Portal agent as we could not find another option. But for BPPM PATROL integration service and its BPPM agent, there is another option with active/active failover. We were excited when we first learned that integration service supports active/active failover because it looked similar to the architecture of BMC Portal with active/active RSM failover. However upon close examination we realized that the major difference is that BPPM integration service and its BPPM agent have no concept of cluster while BMC Portal App server treats the active/active RSM pair as one cluster. Portal data from the same metrics and same instance will be stored in the same database table regardless which RSM was used as the "middleman". However, when PATROL data are sent to the 2nd BPPM integration service when the 1st integration service is unreachable, the data will not be stored in the same table in BPPM database as the data sent through the 1st integration service - thus not displayed in the same graph.

We used the same scoring system as described in the last post (Part 3) to compare disk-level OS cluster and active/active failover for BPPM Integration Service. Our comparison result showed that disk-level OS cluster scored 27 points (partial yes to #1, and yes to #2 and #3) while active/active failover scored 22 points (yes to #1, #3 and #6). Therefore, we decided to use Microsoft Windows Cluster for all integration service nodes where BPPM agents are installed.

Because all BPPM cells are installed on separate servers as I mentioned in my last post (Part 3), only BPPM agent and integration service are running on integration service nodes. We disabled all event related components such as cells, event adapter, and event log adapter on integration service nodes because they are already running on separate servers (we refer them as cell servers). In Microsoft Windows Cluster, all services within the cluster must be included in the failover group. When one service failed, the entire group must be moved to the secondary node. The more services are included in the failover group, the longer it takes to move the entire group and the higher the risk that another service may fail on the secondary node. Disk-level OS clusters such as Microsoft Windows Cluster always involve some down time during failover. However we managed to limit the down time to under 5 minutes by minimizing the services in the failover group to include only BPPM agent and Integration Service.

In the next couple of posts, I will go through the configuration details on those event-related components.


Monday, July 15, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 3: BPPM cell

It is a common misconception that Integration Service processes both data and events. Actually Integration Service processes PATROL data only and BPPM cell processes events only. In most of BMC's architecture diagrams, Integration Service and BPPM cell are co-located on the same server (called Integration Service node). In reality, Integration Service and BPPM cell are not directly related and are not required to be installed on the same server. In this post, I will focus on BPPM Cell. BPPM Agent and Integration Service will be discussed in the next post.

In Our BEM 7.4 environment, we have 9 pairs of cells running in high availability as application clusters with 7 cells on Windows 2003 servers and 2 cells on Linux RHEL 5.5 64-Bit servers. We were happy with the configuration as we never experienced down time even during BEM upgrade from 7.2 to 7.4. In BPPM 9.0, since BMC suggested a disk-level OS cluster for BPPM cells, we decided to do a side-by-side comparison between application cluster and OS cluster. We used a 10-point scoring system for the following 4 criteria: 1) Can the (cell) pair failover with no down time; 2) Is the (cell) pair a cluster (treated as one by their consumers); 3) Can the sender automatically switch to the 2nd destination when the 1st destination is unreachable; 4) Can the sender buffer the content and resend if the destination is unreachable. Each yes is 10 points and each no is 0 point. Partial yes would get a score between 1 and 9. We also added 3 bonus points for 5) automatically backing up data storage; and 2 bonus points for 6) lower hardware and OS cost.

Our comparison result showed that cell application cluster option scored 45 points (yes to all 6) while OS cluster option scored 37 points (partial yes to #1 and yes to #2, #3, and #4). So we kept the same high availability configuration as in BEM 7.4. We installed our Windows cells on Windows 2008 servers with 'cell only' option, not sharing the servers with BPPM Agent and Integration Service. We decided to delay Linux cell migration to the next phase to minimize the involvement of another organization. Our test has shown that BEM 7.4 cells can integrate well with BPPM 9.0 cells since very little has been changed in BEM cell features and architecture.

To take advantage of BPPM Server's analytic features, we added another pair of external BPPM cells for BPPM internal events since we wanted all events to be processed in external cells first. By default, all BPPM internal events are sent to the embedded cell on BPPM server. After BPPM 9.0.20, we were able to make a configuration change in pronet.conf on BPPM server so that all internal events are sent to an external cell.

Monday, July 8, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 2: BPPM server

We installed BPPM Server 9.0 on a physical Windows 2008 server with 32GB memory. From what we have learned in BPPM 8.1 and 8.5, we knew that BPPM server is resource intensive. So we doubled the amount of memory from what BMC recommended, and we tried to offload as many components as possible to other servers. We configured BPPM database on an Oracle RAC instance running on AIX servers.

For high availability, we decided to use Microsoft Windows Cluster as recommended by BMC. We were fully aware that it could take up to 10 minutes for the secondary server to resume operation after the primary server failed - meaning no data collection, no service impact updating, and no GUI access for up to 10 minutes. But we could not find a better option for BPPM server high availability as BMC still does not support application-level failover for BPPM Server.

Data in BPPM server are sampled every 5 minutes even some data could be collected more frequently at the data source (e.g. PATROL agent). Unlike events, when the destination is unreachable, data are simply thrown away, not buffered. We would miss 2-3 data points during BPPM server failover. We realized that this is something we can compromise because trended data are not as critical as availability events in a hospital environment.

Similarly we can compromise no service impact updating for 10 minutes as long as the lower-level critical events can still generate emails and incident tickets. This means that we will only use the embedded SIM cell for emails and incident tickets related to service models. All raw events will be sent to remote cells first for BEM level emails, tickets, and actions. They will be forwarded to this embedded SIM cell for service impact analysis and probable cause analysis. In a later post, I will go through the details on how we configured high availability for remote cells.

Unfortunately the lack of GUI access for 10 minutes is not something we can compromise. 10 minutes could mean the difference between life and death in a hospital environment. A 10-minute enterprise visibility blackout is simply not an option. In a later post, I will talk about how we have overcome this limitation.

Since our Atrium version was older than 7.6.04, we could not install Atrium CMDB Extension before installing BPPM Server 9.0. But we have learned that we can enable this option in BPPM Server later after Atrium upgrade without re-installing BPPM Server.



Monday, July 1, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 1: Background

One of my recent clients is the largest municipal healthcare organization in the country consisting of hospitals, nursing facilities, treatment centers, and community clinics. I have helped them implement BMC Event Manager (BEM), BMC PATROL, BMC Portal, and integrate monitoring tools from other vendors with BEM.

We run 9 pairs of BEM cells, 2 instances of IBRSD, 2 instances of bii4P3, 2 instances of IIWS, and 2 instances of BMC Event Adapters to process 12,000 events from 7 different monitoring tools and generate 800 automated Remedy tickets per day. I refer our architecture as 'cell cloud' because this robust and flexible event processing service is hosted by servers located in different data centers, on different operating systems, and even based on different versions of BEM releases for a while. Every component in the cloud is configured as seamless high availability at application level and all events to the cloud are buffered with no down time and no transaction loss to meet the highest business requirements by hospitals. Our event processing is based on 'cell extension' technology that I made extensive customization to the out-of-box cell knowledge base. By eliminating policies and standardizing event processing with dynamic data tables, our BEM implementation is powerful, flexible, and easy to maintain.

Being a large BMC customer in healthcare industry, we have been encouraged by BMC to migrate to BPPM. And we were constantly invited by BMC to attend BPPM briefings, roadmaps, demos, webinars, and Q&A sessions. Prior to BPPM 9.0, we participated in extensive evaluations on both BPPM 8.1 and BPPM 8.5. We have given BMC extensive feedback on the limitations in BPPM that had been holding us back from migrating to BPPM.

When we finally made a decision to migrate our BEM 7.4 to BPPM 9.0, our primary objective is to preserve all the scalability, performance, flexibility, and high availability in BEM 7.4. We are so proud that our 'cell cloud' technology survived emergency data center failover during Hurricane Sandy with no down time. We don't want to compromise any of these capabilities when upgrading to BPPM 9.0.

In the next few posts, I will share my experience and the lessons learned from migrating BEM 7.4 to BPPM 9.0. Your comments are greatly appreciated.