Monday, August 26, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 9: Cell extension and architecture

In BEM 7.4, we made extensive customization in MRL rules to standardize event processing regardless of event sources. Because our events come from 7 different event sources (BMC PATROL, BMC Portal, and 5 other vendors' monitoring tools), we didn't want to write 7 different sets of rules to process events. We wanted all events share the same processing rules as much as possible.

In our standardized event processing rules, each event goes through the following stages: mapping, conversion, filtering, host/device look-up, action look-up, blackout look-up, aggregation/correlation, update, email notification, ticketing, action, and forwarding. Only at mapping and conversion stages, events from different event sources have their own processing rules. All events share the same processing rules starting from filtering stage. This has allowed us to quickly integrate events from any event source into BEM/BPPM cells in a matter of days or even hours.

The advantage of using cell policies is that you don't have to know MRL programming. But policies slow down the cell processing speed a lot. And, most importantly, there is no policy equivalent for execute rule. While we have already made extensive customization in MRL rules, there is no advantage for us to use policies. We disabled all out-of-box policies. We also enforced strict name convention to make our rules easy to support and upgrade. We have about 30 custom rule files supporting over 20 advanced features.

Our cells were architected in three levels. The first level is for look-up. Each event source has its own first level cell so that if one event source is having an event storm it won't affect the events from other event sources. The second level is for update and notification. All event sources share the same second level cell so that events can be correlated easily. The third level is for service impact.

During our migration from BEM 7.4 to BPPM 9.0, we migrated our first-level and second-level cells as BPPM 9.0 remote cells located on their own servers as I talked about it in Part 3. All our custom MRL rules were migrated into the new cells with little change since our customization was kept in separate files. The embedded cell on BPPM server will replace our old third-level service impact cell. Since we did little work for service impact in BEM 7.4, we plan to do a new implementation for service impact once our Atrium upgrade is completed.

The only major change we had to make is the custom GUI display. In BEM 7.4, we made several display templates for administrators, developers, and service desk operators in its Java GUI (BMC Impact Explorer). Since there is no direct migration path from Java GUI to web GUI, we had to re-create all templates in BPPM 9.0 web GUI.



Monday, August 19, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 8: Dual GUI

Back in Part 2, I mentioned that one of the major limitations to migrate BEM 7.4 to BPPM 9.0 is the lack of GUI access for up to 10 minutes during BPPM server failover. We are a hospital environment, we have enterprise service desk operators monitoring BEM/BPPM GUI 24x7 to escalate trouble ticket acknowledgement and processing.

In BPPM 9.0, a web GUI is used as operations console. Because the web server is located on BPPM server and it takes up to 10 minutes for the secondary BPPM server to resume operation during BPPM server failover, our service desk would experience a total enterprise blackout for up to 10 minutes. This limitation does not meet our business requirement in a hospital environment. It had been holding us from migrating to BPPM sooner. To overcome this limitation, we had to think out of box again.

In BEM 7.4, a Java GUI (BMC Impact Explorer) is used as operations console. All cells and login servers are set up in their native application-level failover with no downtime. During the failover, our service desk operators would see the yellow highlight for several seconds before all operations are resumed. We decided to see if we can mix BPPM 9.0 cells with BEM 7.4 login servers and BMC Impact Explorer.

We made no change to BPPM 9.0 configuration on BPPM server, BPPM agents, and BPPM cells. We kept a pair of BEM 7.4 login servers (also called admin servers) on two separate Windows servers. We simply registered all BPPM 9.0 cells with these two BEM 7.4 login servers. Now our service desk operators can continue using the Java GUI (BMC Impact Explorer) to access BPPM 9.0 cells.

During BPPM server failover, the only cell that our service desk operators cannot see for up to 10 minutes is BPPM main cell - which displays intelligent events generated by BPPM server and service impact only. All alerts raised by monitoring tools, all email notifications, and all automated Remedy ticket generation are displayed and processed by remote cells with application-level failover. Our service desk operators can continue seeing all of them during BPPM server failover. Absolutely no downtime and no enterprise blackout! We were so thrilled to see how great the hybrid configuration worked.

For ESM administrators and operations support, we can pick and choose between BPPM 9.0 web GUI and BEM 7.4 Java GUI. BPPM 9.0 web GUI allows us to associate data with events while BEM 7.4 Java GUI gives us fast access to events and dynamic tables. By keeping both BPPM 9.0 web GUI and BEM 7.4 Java GUI, not only we avoided total enterprise blackout, we were able to convince everyone to finally migrate BEM 7.4 to BPPM 9.0.

Monday, August 12, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 7: PATROL

We run PATROL on both AIX and Windows. In addition to monitor operating systems, log files, and VMWare, we have developed custom PATROL knowledge modules to monitor CACHE database, media manager, Veritas Cluster Server, and clinic applications. All our PATROL agents were upgraded or deployed in version 9.0 so we could use the automatic workflow to push PATROL data to integration service and PATROL events to BPPM cell.

As I mentioned in my previous posts, all our BPPM components are configured for high availability to meet the highest business requirements in a hospital environment. For PATROL data, the high availability of all integration services and BPPM agents are configured through Microsoft Windows Cluster. We put the clustered server name in pconfig variable "/AgentSetup/integration/integrationServices". For PATROL events, the high availability of all BPPM cells are configured through their native application cluster. We put both primary and secondary server names in pconfig variable "/EventSetup/Configuration/EventCells".

After we replaced bii4p3 (PATROL event adapter) with direct PATROL agent to cell connection using pconfig variables, bii4P3 is no longer needed for PATROL agent 9.0 to send PATROL events to BPPM cell. However we still kept bii4P3 running on all our test systems after migration. This allows us to receive PATROL events on both production BPPM cell and test BPPM cell at the same time for live troubleshooting when needed since pconfig variable "/EventSetup/Configuration/EventCells" can only send PATROL events to one cell.

We had to change MRL rules in BPPM cell to detect PATROL agent down or PATROL agent connection loss events for direct PATROL agent to cell connection. These events are very different from the events using bii4p3. And we also had to develop a few rules to capture PATROL agent up or PATROL agent connection up events and match them to PATROL agent down or PATROL agent connection loss events. In addition, we developed similar rules for PATROL agent connection with integration service. These infrastructure connection events, along with all other events reported by event sources, are fully integrated with email notification and Remedy ticketing system at the back end.

We had an PATROL Central console 7.5 (both Windows edition and web edition) as well as PATROL Classic Console 3.5 running before the migration. They still worked well with PATROL agent 9.0 after the migration. We didn't find any need to upgrade them.


Monday, August 5, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 6: BMC event adapter mcxa

BMC Event Adapter (mcxa) is an adapter BMC provides to integrate SNMP traps into BPPM/BEM cells. It was developed in Perl and can be installed anywhere. Although most vendors nowadays can send SNMP traps when alerts are raised, we prefer to use OS scripts to integrate events from non-BMC monitoring tools into BPPM cells.

While OS scripts can be logged, buffered, and retried with seamless failover, SNMP traps usually cannot - meaning that a slightest network instability could result in trap loss. Because SNMP trap based event integration is less reliable and more difficult to troubleshoot, we only use it when the monitoring tool does not provide a way to execute OS scripts when alerts are raised. In addition, SNMP trap based event integration requires an adapter while OS script based event integration makes direct connection to a BPPM cell.

Out of 5 non-BMC monitoring tools we have, only one is integrated into a BPPM cell using BMC Event Adapter (mcxa) because it cannot execute an OS script when an alert is raised.

Very little has been changed for BMC Event Adapter (mcxa) from BEM 7.4 to BPPM 9.0. We first converted its MIB file to map file. Then we configured BMC Event Adapter (mcxa). We had to change the default parameter settings for PollInterval, ReadsPerEngine, SnmpRcvbuf to maximize the capacity of mcxa in order to accommodate the large volume of the incoming SNMP traps. We also had to double the default value for SnmpTrapLength parameter in order to accommodate the large size of the incoming SNMP traps.

To increase the reliability, we installed two instances of BMC Event Adapter (mcxa) with one instance on each cell server. From the non-BMC monitoring tool, we configured the SNMP traps to be sent to those two mcxa instances simultaneously. This dual-configuration helps to minimize the SNMP trap loss in case of network connection failure. It also helps to address the lack of out-of-box high-availability feature in BMC Event Adapter (mcxa).

For the cell knowledge base, we made a minor change in the auto-generated mcsnmptrapdmib.baroc file so that we could write one rule instead of 50+ rules for all 50+ OIDs. We also added a de-duplication rule to remove the duplicated SNMP traps from the 2nd mcxa instance.

For the rest of cell knowledge base, we followed our standard procedures to map, convert, filter, correlate, update, define actions, execute actions, send email, and create tickets. In a later post, I will go into more details of our standard procedures in our cell knowledge base that universally apply to events from all event sources.