Monday, July 22, 2013

Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 4: BPPM agent and integration service

Based on 250K attributes per BPPM agent sizing guideline from BMC, we installed 2 BPPM agents on 2 Windows 2008 servers - one for Portal data and the other for PATROL data. As we continue adding more PATROL agents, we will add another BPPM agent on the 3rd server once the number of PATROL agents and attributes exceeds the current capacity.

For high availability, we decided on disk-level OS cluster for BPPM Portal agent as we could not find another option. But for BPPM PATROL integration service and its BPPM agent, there is another option with active/active failover. We were excited when we first learned that integration service supports active/active failover because it looked similar to the architecture of BMC Portal with active/active RSM failover. However upon close examination we realized that the major difference is that BPPM integration service and its BPPM agent have no concept of cluster while BMC Portal App server treats the active/active RSM pair as one cluster. Portal data from the same metrics and same instance will be stored in the same database table regardless which RSM was used as the "middleman". However, when PATROL data are sent to the 2nd BPPM integration service when the 1st integration service is unreachable, the data will not be stored in the same table in BPPM database as the data sent through the 1st integration service - thus not displayed in the same graph.

We used the same scoring system as described in the last post (Part 3) to compare disk-level OS cluster and active/active failover for BPPM Integration Service. Our comparison result showed that disk-level OS cluster scored 27 points (partial yes to #1, and yes to #2 and #3) while active/active failover scored 22 points (yes to #1, #3 and #6). Therefore, we decided to use Microsoft Windows Cluster for all integration service nodes where BPPM agents are installed.

Because all BPPM cells are installed on separate servers as I mentioned in my last post (Part 3), only BPPM agent and integration service are running on integration service nodes. We disabled all event related components such as cells, event adapter, and event log adapter on integration service nodes because they are already running on separate servers (we refer them as cell servers). In Microsoft Windows Cluster, all services within the cluster must be included in the failover group. When one service failed, the entire group must be moved to the secondary node. The more services are included in the failover group, the longer it takes to move the entire group and the higher the risk that another service may fail on the secondary node. Disk-level OS clusters such as Microsoft Windows Cluster always involve some down time during failover. However we managed to limit the down time to under 5 minutes by minimizing the services in the failover group to include only BPPM agent and Integration Service.

In the next couple of posts, I will go through the configuration details on those event-related components.


No comments:

Post a Comment