Monday, July 8, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 2: BPPM server

We installed BPPM Server 9.0 on a physical Windows 2008 server with 32GB memory. From what we have learned in BPPM 8.1 and 8.5, we knew that BPPM server is resource intensive. So we doubled the amount of memory from what BMC recommended, and we tried to offload as many components as possible to other servers. We configured BPPM database on an Oracle RAC instance running on AIX servers.

For high availability, we decided to use Microsoft Windows Cluster as recommended by BMC. We were fully aware that it could take up to 10 minutes for the secondary server to resume operation after the primary server failed - meaning no data collection, no service impact updating, and no GUI access for up to 10 minutes. But we could not find a better option for BPPM server high availability as BMC still does not support application-level failover for BPPM Server.

Data in BPPM server are sampled every 5 minutes even some data could be collected more frequently at the data source (e.g. PATROL agent). Unlike events, when the destination is unreachable, data are simply thrown away, not buffered. We would miss 2-3 data points during BPPM server failover. We realized that this is something we can compromise because trended data are not as critical as availability events in a hospital environment.

Similarly we can compromise no service impact updating for 10 minutes as long as the lower-level critical events can still generate emails and incident tickets. This means that we will only use the embedded SIM cell for emails and incident tickets related to service models. All raw events will be sent to remote cells first for BEM level emails, tickets, and actions. They will be forwarded to this embedded SIM cell for service impact analysis and probable cause analysis. In a later post, I will go through the details on how we configured high availability for remote cells.

Unfortunately the lack of GUI access for 10 minutes is not something we can compromise. 10 minutes could mean the difference between life and death in a hospital environment. A 10-minute enterprise visibility blackout is simply not an option. In a later post, I will talk about how we have overcome this limitation.

Since our Atrium version was older than 7.6.04, we could not install Atrium CMDB Extension before installing BPPM Server 9.0. But we have learned that we can enable this option in BPPM Server later after Atrium upgrade without re-installing BPPM Server.



No comments:

Post a Comment