BPPM (BMC ProactiveNet Performance Management) or TrueSight Operations Management (the rebranded name) suite is the latest solution from BMC Software for enterprise system management. It combines the data analytic engine from ProactiveNet, the event processing engine from BMC Event Manager (BEM), and the server/application monitor from PATROL into one product. This blog is intended to share information and experience on TrueSight/BPPM implementation, customization, and integration.
Monday, December 30, 2013
New Year's Resolutions
2013 has been a significant year. We created our own version of BPPM architecture by modifying what BMC recommended and proved that it works better in the real enterprise IT environment. We standardized our BPPM/BEM extension software at our client site and now they are able to maintain, upgrade, and extend further on their own.
We started moving into the direction to make our business more scalable by developing software and training courses in addition to providing consulting services. We signed an agreement with our partner to jointly market our training courses once they become available.
I became much more involved at BMC online community. By sharing experience with other BMC users, I not only learned lots of technical details, I also gained tremendous insight on what our training courses should be focused on.
I started blogging on BPPM implementation in the summer. I want to take this opportunity to thank my readers for their generous feedback and continuous support. My blog has been recently included by www.itCentralStation.com, a new review site of enterprise IT products.
In 2014, I have so much to look forward to. We will continue providing consulting services to our clients with our rare expertise in customization and integration. We will complete the standardization of our BPPM extension software and increase our customer base. We will complete the development of our training courses to provide BMC customers a convenient and affordable option to learn practical skills on BPPM implementation based on our field experience instead of textbook theories.
I am so excited for 2014. I wish all of you a happy and prosperous new year.
Monday, December 23, 2013
Merry Christmas from World Opus Technologies
Here is a beautiful Christmas photo from Austin. Enjoy!
Monday, December 16, 2013
BPPM Implementation Considerations - Part 5: Customize at the right place
BPPM suite has three major components: BMC ProactiveNet, BPPM Cell (BEM), and PATROL. Both BPPM Cell and PATROL are more than 10 years old. One of the primary reasons that they are still going strong today is because they both allow you to add your own solutions to them seamlessly.
Before you start developing your own custom solutions, take a step back to think about what options you have and where you should place your customization. What would be the impact on accessibility and resource consumption on the underline servers? What would be the impact on deployment of your custom solutions? What would be the impact on future maintenance and upgrade?
In PATROL, you can develop custom knowledge modules and you can also plug in your own PSL code as a recovery action into a parameter. In BPPM Cell, you can develop your own event classes, MRL code, dynamic tables, and action scripts to extend the out-of-box knowledge base.
In general, if you have a choice between customizing PATROL and customizing BPPM Cell to manage events, customizing BPPM Cell would require less effort and result in less impact to the servers that are being monitored. Here are a few reasons:
1) PATROL is running on the servers you don't own, have limited access, and may not be familiar with. For example, I was recently helping a client debug a custom KM running on AS400. I had to get help from AS400 sysadmin just to add one line in its PSL code.
2) PATROL is often sharing the server with mission critical applications. Poorly written PSL code could potentially impact the mission critical applications negatively.
3) The same custom knowledge module may need to be running on more than one server, thus requiring more time to deploy and upgrade.
4) BPPM Cell is running on your own infrastructure server. It is infinitely scalable as a peer-to-peer architecture. If resource has ever become an issue, you can add more cells either on the same server or on a different server (even with different operating system). you can split a cell horizontally by processing phases, or you can split a cell vertically by event sources.
Monday, December 9, 2013
BPPM Implementation Considerations - Part 4: Monitor the monitors
A good BPPM implementation not just monitors your IT infrastructure, it also monitors each and every BPPM component including BPPM server, BPPM agent, BPPM cell, PATROL agent, PATROL adapter service/process, SNMP adapter service/process, IIWS service/process, IBRSD service/process, ..., etc. The self-monitoring metrics include component status and connection status.
The events alerting that a BPPM component down or a BPPM connection down are mostly sent to its connected BPPM cell automatically. Some of the self-monitoring events require quick activation. You need to identify those events as they have different event classes and message formats. And you need to notify the right people about those events.
Some components may have multiple ways to be monitored and you just need to pick up one way that works the best in your environment. For example, when a PATROL agent lost its connection with PATROL Integration Service, you can see an event directly sent from PATROL agent, another event from PATROL LOG KM if you configured it to monitor IS connection down log entry, and yet a third event from PATROL Integration Service if you activated it in BPPM GUI.
You may need to reword the message of a self-monitoring event for better readability as some messages are not clear at all. For example, by default, PATROL agent connection down event contains the following slots:
cell='PatrolAgent@server1@172.118.2.12:3181';
msg='Monitored Cell is no longer responding';
You may want to reword the message to look like this:
msg='PatrolAgent@server1@172.118.2.12:3181 is no longer responding';
because it is the PATROL agent that is no longer responding, not the cell.
For the notification method, the most reliable way is local email fired from the cell that receives the self-monitoring events. Since your path to the ticketing system may be down when your BPPM components are experiencing problems, your back-end ticking system should not be the only way to send notification for your self-monitoring alerts. It should be used in addition to your local email notification.
Monday, December 2, 2013
BPPM Implementation Considerations - Part 3: Achieve the highest ROI through integration
BPPM/BEM cell provides extremely flexible and robust API and adapters to integrate with just about any monitoring software out there. Whether you are running monitoring tools from other commercial vendors such as IBM and Microsoft, or you use open source tools like Nagios, it is fairly straight forward to integrate alerts from these tools into BPPM/BEM cell using either its OS API or SNMP adapter. If you use home-grown scripts, all you need to do is to add an API call at the end.
If your back-end ticketing system is Remedy, the out-of-box 2-way integration (IBRSD) between BPPM/BEM cell and Remedy is more efficient than Remedy gateways for other monitoring tools. It is fairly straight forward to configure two instances of IBRSD as active/active failover, so your chance of waking up at 3am to fight fire is very slim. Since the license of IBRSD is included in the price of BPPM/BEM, you instantly cut down the cost when you stop paying for the Remedy gateway license for other monitoring tools.
Other added benefits include reduced maintenance effort for other monitoring software, less customization in Remedy, consistent ticket information for all monitoring tools, and possible event correlation between events from different monitoring tools. You will also make your NOC team's job easier.
I understand that it is not always easy to convince people who work on other monitoring software to integrate into BPPM/BEM due to organizational silo and technical complexity. It is important to pick up the right candidate for the first BPPM/BEM integration. Once the ROI is obvious, people will become more supportive for BPPM/BEM integration. In addition, it is also important to set up a consistent framework for all integration since BMC does not provide a standard for integration. Once you have set up a consistent framework for one-way and two-way integration, your next integration will become much easier.
At one of my past clients, it took our BPPM/BEM team three months to work with the other team to finish our first integration because the integration project had the lowest priority with the other team. Once everyone saw how well the integration worked and how much license fee it saved, our second integration took only 4 weeks to finish. Subsequently our third integration took only three days to finish.
Monday, November 25, 2013
BPPM Implementation Considerations - Part 2: Keep the total cost of ownership in mind
After talking to several BPPM customers, I noticed that they all have at least twice the size of the operations team comparing to the team at my clients just to keep BPPM operations going. What is worse is that their operations team also need to have the implementation skill set to constantly patch up the implementation.
Before you even start implementation, consider the following aspects:
1) Scalability: When your environment grows with more servers, more applications, or more integration, will your architecture still work? How easy would it be to split horizontally (based on processing steps) and vertically (based on incoming traffic)?
2) Upgrade: What can you do right now to make future upgrade easier? You may want to consider having a name convention, saving configuration in a separate repository, and documenting everything consistently.
3) High Availability: High availability not only helps with business continuity, it also helps your team from constantly fighting fire. You have several options in high availability: Application level failover, OS based failover, active/active load balance, or duplication. Which option would best fit your needs for each BPPM component and how much would it cost? For example, a native application level failover might be your best choice for BPPM cells if your business cannot afford to miss a server down alert. But a simple duplication of PATROL 7 console is probably sufficient for you comparing to OS based failover which would cost nearly twice as much.
4) Implementation Repeatability: Do you keep an accurate implementation document so that installation and configuration of each BPPM component is repeatable? You need to implement everything on a test system first and carefully document everything as you go. Production deployment should be a straightforward 'follow the doc' process. It also gives you a perfect opportunity to update the implementation document for anything you have missed.
A common mistake I have seen is to start the implementation directly on a production system. After several months of figuring things out, it finally went live with many junk files sitting under the implementation directory. Then you realized that you actually needed a test system because you won't be able to make and test changes otherwise. Now you don't know how to configure your test system to make it identical to your production system since you have lost track on what made the production system work and what did not.
5) Operations Standardization: Do you have a standard operations procedure document? For example, if a new server is added into your PeopleSoft Payroll application, do you have a document containing the steps for the operations team to add that server to PATROL, BPPM integration service, BPPM cell, BPPM server, BPPM GUI, and automated Remedy ticketing?
Monday, November 18, 2013
BPPM Implementation Considerations - Part 1: Meet your business requirements
"BPPM Implementation Consideration" blog series will try to address several important considerations at requirement level and architecture level. Implementing BPPM is a lot like building a house. Many considerations at requirement level and architecture level are like the foundation of the house. They need to be determined at the very beginning.
The most important consideration in BPPM implementation is your business requirements. The management of your organization, your entire implementation team, and other stakeholders should have a clear understanding on a list of business requirements that your BPPM implementation is expected to meet. Then you will need to translate this list of business requirements into a list of technical requirements with a category assignment such as mandatory, strategic, cost-saver, and nice-to-have.
Only now you can map each technical requirement into a list of detailed BPPM features and prioritize the implementation of each feature. This will become your project scope. Based on your project scope, you can plan your project timeline and budget. If you outsource your BPPM implementation to a consulting company, it is critical that you do your homework on your business requirements and technical requirements first. Then work closely with the architect (not just the project manager) of the consulting company to determine the project scope.
However many new BPPM customers I have talked to seem to do it backwards. They came up with a budget first without knowing exactly what BPPM features to implement and how long the implementation will take. Then they picked up a list of BPPM features to implement from product datasheet without knowing how each feature relates to their business bottom line.
As an example, here is the process taken at one of my past clients. One of the top business requirements was to cut down the cost on Remedy Gateway licenses from multiple monitoring software vendors. This was translated into a technical requirement like this: Alerts from multiple monitoring software must be integrated into one alert management tool to communicate with Remedy for ticket creation. This requirement was categorized as cost-saver. This technical requirement was mapped into these BPPM features: Event to BPPM cell integration through API and SNMP traps, msend API installation, SNMP trap adapter high-availability implementation, custom BPPM cell MRL rules to process events from multiple vendors, IBRSD high-availability implementation, and event to ticket categorization in BPPM cell. The return was a 6-figure annual license saving year after year with an investment of 5-figure consulting fee. This ROI went straight to help business bottom line.
Monday, November 11, 2013
PATROL LOG KM Examples - Part 5: Parsing script output instead of log file
Normally when you write your own script to collect data, you would need to write a custom KM to parse the result and send out alerts. Although LOG KM doesn't provide the flexibility offered by a custom KM, it saves tremendous amount of development and maintenance effort comparing to writing a custom KM. All features available to parse a log file work the same way when parsing the output of a script.
For example, if you want to check the availability of a website, you would want to write a script to ping the website periodically and get an alert when the website is unreachable. If we use www.bmc.com in our example, your script would look like:
ping www.bmc.com
First save this script in a file C:\scripts\ping_bmc.bat.
In your LOG KM configuration screen, put C:\scripts\ping_bmc.bat as your log file name and 'PING_BMC' as the logical name for the instance. Then select 'Script' as your file type. The default file type is 'Text File'. Please see the screen shot included in 'PATROL LOG KM Examples - Part 2' post for the locations of these selections.
In the 'Default Settings for Search Criteria' section, you have two ways to send alerts to BPPM/BEM cell: 1) Use recovery action to send parsing result as discussed in 'PATROL LOG KM Examples - Part 1' post; or 2) Use 'Custom Event Message' and 'Custom Event Origin' as discussed in 'PATROL LOG KM Examples - Part 2' post.
For this particular example, I found that using option 2) would work better because I can simply put "Unable to reach www.bmc.com." in my 'Custom Event Message' instead of the raw output from the script. I can also put '%APPCLASS%.%FILENAME%.%LOGICALNAME%' as my 'Custom Event Origin'.
In your search criteria configuration screen, use '0% loss' as your search string and check the 'NOT' box next to it because we only want to be alerted when there is a packet loss.
When there is a packet loss, or when the script output states "Ping request could not find host www.bmc.com.", you will receive an event in BPPM/BEM cell as follows:
Monday, November 4, 2013
PATROL LOG KM Examples - Part 4: A not so simple case of multiple-line search
For example, if you want to capture the following two lines in your log file and send out an alert message like "User: root password will expire in 3 days."
root 21292 c Mon Oct 28 08:00:00 2013
! Your password will expire in 3 days.
Monday, October 28, 2013
PATROL LOG KM Examples - Part 3: A simple case of multiple-line search
If the additional lines you want to include are after the line that matches the search string pattern, the solution is simple. For example, if you would like to have the following two lines included in your matched log entry:
031605: Error: Disc Full
/hd001 mounted as /opt
You can use 'Disc Full' as your search string pattern. To make the matched log entry contain one additional line after the line that matched the search string pattern, you simply put '2' in 'Number of Lines in Log Entry' field in LOG KM instance configuration screen. (Please see the location of this field from the LOG KM instance configuration screen displayed in 'PATROL LOG KM Examples - Part 2' post.) And you can configure the rest of LOG KM as usual. You can send one alert per polling cycle as described in 'PATROL LOG KM Examples - Part 1' post or send one alert per matched log entry as described in 'PATROL LOG KM Examples - Part 2' post.
However, if the additional lines you want to include are before the line that matches the search string pattern, the solution is not so simple. For example, if you would like to have the following two lines included in your matched log entry:
root 21292 c Mon Oct 28 08:00:00 2013
! Your password will expire in 3 days.
Here you would need to use some strings from the second line as your search string pattern because nothing from the first line is unique enough as a search pattern. Then how can we include information from the line before the line that matches the search string pattern? In the next post, I will discuss a solution to this example by using an advanced feature of PATROL LOG KM called 'Multiline Search'. Stay tuned.
Monday, October 21, 2013
PATROL LOG KM Examples - Part 2: Sending one alert per matched log entry
There is another way to configure PATROL LOG KM to send one alert per matched log entry. This option is lesser known, but it is more flexible than sending one alert per polling cycle because you can specify alert severity separately for each string pattern. For example, you can specify severity ALARM for each log entry that matches string pattern "fatal", and specify severity WARNING for each log entry that matches string pattern "retry".
To send one alert per matched log entry, you need to configure "Default Settings for Search Criteria" section as shown in the following example:
Monday, October 14, 2013
PATROL LOG KM Examples - Part 1: Sending one alert per polling cycle
PATROL LOG KM is one of the most commonly used KMs. By design, each LOG KM instance monitors one log file. Two important parameters in LOG KMs are LOGErrorLvl and LOGMatchString. When a string pattern is found in the log file, LOGErrorLvl will go to alarm and the matched log entry will be saved in LOGMatchString. Since you can configure LOG KM to search for multiple string patterns in each log file, all matched log entries are saved together in one LOGMatchString parameter.
If you would like to send all matched log entries as one alert, you can use recovery action to generate an event and send to BPPM/BEM cell. I have seen many examples that use variable '__udefvar__' in pconfig rules. But '__udefvar__' only works with PATROL Notification Server. If you don't use PATROL Notification Server as it is optional to use it prior to PATROL agent 9.x and there is no need to use it with PATROL agent 9.x, you can use event_trigger2() PSL call instead. Here is an example pconfig rule set and PSL code.
"/AS/EVENTSPRING/LOGMON/__ANYINST__/LOGErrorLvl/arsCmdType" = { REPLACE = "PSL"},
"/AS/EVENTSPRING/LOGMON/__ANYINST__/LOGErrorLvl/arsCommand" = { REPLACE = "C:\\BMC\\Patrol3\\lib\\psl\\LOGKM_RecoveryAction.psl" }
Monday, September 23, 2013
Monitoring PATROL Agent 9.x Status: Do I need AS_AVAILABILITY KM?
So the answer here is no. You don't need to use AS_AVAILABILITY KM to monitor PATROL agent 9.x status. As you are about to see, BPPM cell heartbeat events are totally automatic with built-in high availability and no PATROL agent configuration. AS_AVAILABILITY KM was developed before BMC acquired BPPM cell, it is still a great option if you have a 'PATROL only' environment without BPPM/BEM.
To use AS_AVAILABILITY KM, you need to configure the KM by selecting one PATROL agent as the 'pinger' and adding other PATROL agents as 'pingees'. Every time you deploy a new PATROL agent or decommission an existing PATROL agent, you would need to change AS_AVAILABILITY KM configuration. On the other hand, to use BPPM cell heartbeat events, you don't need to go through extra steps to register each PATROL agent with BPPM cell. As long as you set pconfig variable "/EventSetup/Configuration/EventCells" in your PATROL agent 9.x to send PATROL events to a BPPM cell, that BPPM cell will automatically monitor the status of the PATROL agent.
If the 'pinger' in your AS_AVAILABILITY KM goes down, you won't be able to monitor the status of other PATROL agents. To make AS_AVAILABILITY KM more robust, you would have to set up a second 'pinger' and complicated logic to coordinate between these two 'pingers' to avoid duplicated alerts. On the other hand, as long as your BPPM cell is set up as high availability, you don't need to go through extra steps to make BPPM cell heartbeat events as high availability. Your PATROL agent status will always be monitored by the active H/A BPPM cell.
To best use BPPM cell heartbeat events, I recommend to reword the event message because the out-of-box message doesn't contain enough information. When a PATROL agent goes down, you would receive an event with out-of-box slots like this:
MC_CELL_HEARTBEAT_FAILURE;
cell='PatrolAgent@server1@172.118.2.12:3181';
msg='Monitored Cell is no longer responding';
...
END
You may want to reword the msg to 'PatrolAgent@server1@172.118.2.12:3181 is no longer responding'. For its reciprocal MC_CELL_HEARTBEAT_ON event, you may want to reword its message in a similar way.
Monday, September 16, 2013
Parameter (Metrics) Thresholds: Do I still need to set them in PATROL?
In theory, if you send all PATROL data to BPPM server, it seems to be a good idea to have all parameter thresholds set on BPPM server only. Imagine how much time you can save for not having to set parameter thresholds in each PATROL agent and how much network bandwidth you can save for not having to send PATROL events to BPPM cells when those thresholds are violated.
In reality, the answer is yes. You still need to set parameter thresholds in each PATROL agent and let PATROL agent (not BPPM server) generate those events that violate absolute thresholds. You still need to send those PATROL events to BPPM cells. Let BPPM server generate intelligent events only and don't set absolute thresholds in BPPM server. The reason here is that not all PATROL data are in BPPM server.
First of all, PATROL agent does not buffer and resend data if it failed to send data to BPPM server for the first time. This situation can happen when there is a brief network outage such as a router is being rebooted. If it failed on the first try, data are lost forever. On the other hand, you may have already known that the connection between PATROL agent and BPPM cell is more robust as PATROL agent does buffer and resend events to BPPM cell with guaranteed delivery. BPPM server and BPPM cell were separately acquired by BMC Software from two different vendors. They use different communication protocols with different levels of robustness.
Second, PATROL agent only sends numerical data to BPPM server, not text data such as text parameter values and annotated data point values. Often those text data are needed as additional information for the events when numerical parameter thresholds are violated. For example, when using PATROL LOG KM, you may need to include information from a text parameter in the event to show the matched string. The only way to include information from a text parameter in events is to let PATROL agent (not BPPM server) generate the events. In addition, some PATROL KMs (e.g. LOG KM with custom events option, older version of Control-M KM, etc.) call event_trigger() to generate events without using parameter thresholds.
Last but not least, PATROL agent sends data to BPPM server every 5 minutes though PATROL agent may collect data more frequently. For example, CPU utilization is collected by PATROL agent every minute. This means that only every 5th value of CPU utilization is sent to BPPM server. Solely relying on absolute thresholds in BPPM server could result in delaying the alerts for 5 minutes or even missing the alerts altogether.
Monday, September 9, 2013
BMC Impact Integration for PATROL (bii4P): Is it no longer needed?
The simple answer is yes - it is true that bii4P has been eliminated. But bii4P elimination is related to PATROL agent only regardless of the version of BPPM server, BPPM agent, and BPPM cell. Starting from PATROL agent version 9.0, bii4P is no longer required for a PATROL agent to send its events to a cell. The cell can be a BEM 7.x cell, BPPM 8.x cell, or BPPM 9.x cell. If your PATROL agent version is older than 9.0, you still need bii4P even you are running BPPM cell 9.0. That is why you may still see bii4P in some BPPM 9.0 architecture diagrams.
To send PATROL events from a PATROL agent version 9.x to a BPPM/BEM cell configured as high availability, you need to have the following pconfig variables set: "/EventSetup/Configuration/EventCells", "/EventSetup/Configuration/Format", and "/EventSetup/Configuration/Key". For example:
"/EventSetup/Configuration/EventCells" = { REPLACE = "server1/1828,server2/1828" }, "/EventSetup/Configuration/Format" = { REPLACE = "BiiP3" },
"/EventSetup/Configuration/Key" = { REPLACE = "mc" }
*** where server1 is your primary cell server and server2 is your secondary cell server. If you have a standalone cell, you only need to specify server1/1828. ***
However, bii4P is still supported in PATROL agent 9.x. If you would like to send PATROL events to multiple cells (e.g. a production cell and a testing cell for troubleshooting purpose), bii4P is still the only option. In addition, both bii4P and PATROLAgent-to-cell direct connection can co-exist for the same PATROL agent.
In the new PATROLAgent-to-cell direct connection, PATROL agent initiates the connection with a cell and pushes events to the cell. PATROL agent does not have the capability to push events to two different cells at the same time.
bii4P is a standalone adapter. There are two versions of bii4P: bii4P3 and bii4P7. bii4P3 connects to PATROL agents directly while bii4P7 connects to PATROL agents through PATROL console server. bii4P3 is more commonly used nowadays due to its more stable connection with PATROL agents. bii4P initiates the connection with PATROL agents at one end to receive events and pushes the receives events to a cell at the other end.
To send PATROL events to two cells, you can configure two instances of bii4P, or you can configure PATROLAgent-to-cell direct connection for production cell and configure bii4P for testing cell.
Monday, September 2, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 10: Summary
This architecture varies slightly from BMC's standard recommendation as we keep BPPM cells and BPPM Agents totally separated on different servers. In a real enterprise IT environment where data flow is steady but event flow is unpredictable, our architecture offers better resource utilization, more flexibility, and more robust high availability.
<This architecture diagram has been deleted>
Monday, August 26, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 9: Cell extension and architecture
In our standardized event processing rules, each event goes through the following stages: mapping, conversion, filtering, host/device look-up, action look-up, blackout look-up, aggregation/correlation, update, email notification, ticketing, action, and forwarding. Only at mapping and conversion stages, events from different event sources have their own processing rules. All events share the same processing rules starting from filtering stage. This has allowed us to quickly integrate events from any event source into BEM/BPPM cells in a matter of days or even hours.
The advantage of using cell policies is that you don't have to know MRL programming. But policies slow down the cell processing speed a lot. And, most importantly, there is no policy equivalent for execute rule. While we have already made extensive customization in MRL rules, there is no advantage for us to use policies. We disabled all out-of-box policies. We also enforced strict name convention to make our rules easy to support and upgrade. We have about 30 custom rule files supporting over 20 advanced features.
Our cells were architected in three levels. The first level is for look-up. Each event source has its own first level cell so that if one event source is having an event storm it won't affect the events from other event sources. The second level is for update and notification. All event sources share the same second level cell so that events can be correlated easily. The third level is for service impact.
During our migration from BEM 7.4 to BPPM 9.0, we migrated our first-level and second-level cells as BPPM 9.0 remote cells located on their own servers as I talked about it in Part 3. All our custom MRL rules were migrated into the new cells with little change since our customization was kept in separate files. The embedded cell on BPPM server will replace our old third-level service impact cell. Since we did little work for service impact in BEM 7.4, we plan to do a new implementation for service impact once our Atrium upgrade is completed.
The only major change we had to make is the custom GUI display. In BEM 7.4, we made several display templates for administrators, developers, and service desk operators in its Java GUI (BMC Impact Explorer). Since there is no direct migration path from Java GUI to web GUI, we had to re-create all templates in BPPM 9.0 web GUI.
Monday, August 19, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 8: Dual GUI
In BPPM 9.0, a web GUI is used as operations console. Because the web server is located on BPPM server and it takes up to 10 minutes for the secondary BPPM server to resume operation during BPPM server failover, our service desk would experience a total enterprise blackout for up to 10 minutes. This limitation does not meet our business requirement in a hospital environment. It had been holding us from migrating to BPPM sooner. To overcome this limitation, we had to think out of box again.
In BEM 7.4, a Java GUI (BMC Impact Explorer) is used as operations console. All cells and login servers are set up in their native application-level failover with no downtime. During the failover, our service desk operators would see the yellow highlight for several seconds before all operations are resumed. We decided to see if we can mix BPPM 9.0 cells with BEM 7.4 login servers and BMC Impact Explorer.
We made no change to BPPM 9.0 configuration on BPPM server, BPPM agents, and BPPM cells. We kept a pair of BEM 7.4 login servers (also called admin servers) on two separate Windows servers. We simply registered all BPPM 9.0 cells with these two BEM 7.4 login servers. Now our service desk operators can continue using the Java GUI (BMC Impact Explorer) to access BPPM 9.0 cells.
During BPPM server failover, the only cell that our service desk operators cannot see for up to 10 minutes is BPPM main cell - which displays intelligent events generated by BPPM server and service impact only. All alerts raised by monitoring tools, all email notifications, and all automated Remedy ticket generation are displayed and processed by remote cells with application-level failover. Our service desk operators can continue seeing all of them during BPPM server failover. Absolutely no downtime and no enterprise blackout! We were so thrilled to see how great the hybrid configuration worked.
For ESM administrators and operations support, we can pick and choose between BPPM 9.0 web GUI and BEM 7.4 Java GUI. BPPM 9.0 web GUI allows us to associate data with events while BEM 7.4 Java GUI gives us fast access to events and dynamic tables. By keeping both BPPM 9.0 web GUI and BEM 7.4 Java GUI, not only we avoided total enterprise blackout, we were able to convince everyone to finally migrate BEM 7.4 to BPPM 9.0.
Monday, August 12, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 7: PATROL
As I mentioned in my previous posts, all our BPPM components are configured for high availability to meet the highest business requirements in a hospital environment. For PATROL data, the high availability of all integration services and BPPM agents are configured through Microsoft Windows Cluster. We put the clustered server name in pconfig variable "/AgentSetup/integration/integrationServices". For PATROL events, the high availability of all BPPM cells are configured through their native application cluster. We put both primary and secondary server names in pconfig variable "/EventSetup/Configuration/EventCells".
After we replaced bii4p3 (PATROL event adapter) with direct PATROL agent to cell connection using pconfig variables, bii4P3 is no longer needed for PATROL agent 9.0 to send PATROL events to BPPM cell. However we still kept bii4P3 running on all our test systems after migration. This allows us to receive PATROL events on both production BPPM cell and test BPPM cell at the same time for live troubleshooting when needed since pconfig variable "/EventSetup/Configuration/EventCells" can only send PATROL events to one cell.
We had to change MRL rules in BPPM cell to detect PATROL agent down or PATROL agent connection loss events for direct PATROL agent to cell connection. These events are very different from the events using bii4p3. And we also had to develop a few rules to capture PATROL agent up or PATROL agent connection up events and match them to PATROL agent down or PATROL agent connection loss events. In addition, we developed similar rules for PATROL agent connection with integration service. These infrastructure connection events, along with all other events reported by event sources, are fully integrated with email notification and Remedy ticketing system at the back end.
We had an PATROL Central console 7.5 (both Windows edition and web edition) as well as PATROL Classic Console 3.5 running before the migration. They still worked well with PATROL agent 9.0 after the migration. We didn't find any need to upgrade them.
Monday, August 5, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 6: BMC event adapter mcxa
While OS scripts can be logged, buffered, and retried with seamless failover, SNMP traps usually cannot - meaning that a slightest network instability could result in trap loss. Because SNMP trap based event integration is less reliable and more difficult to troubleshoot, we only use it when the monitoring tool does not provide a way to execute OS scripts when alerts are raised. In addition, SNMP trap based event integration requires an adapter while OS script based event integration makes direct connection to a BPPM cell.
Out of 5 non-BMC monitoring tools we have, only one is integrated into a BPPM cell using BMC Event Adapter (mcxa) because it cannot execute an OS script when an alert is raised.
Very little has been changed for BMC Event Adapter (mcxa) from BEM 7.4 to BPPM 9.0. We first converted its MIB file to map file. Then we configured BMC Event Adapter (mcxa). We had to change the default parameter settings for PollInterval, ReadsPerEngine, SnmpRcvbuf to maximize the capacity of mcxa in order to accommodate the large volume of the incoming SNMP traps. We also had to double the default value for SnmpTrapLength parameter in order to accommodate the large size of the incoming SNMP traps.
To increase the reliability, we installed two instances of BMC Event Adapter (mcxa) with one instance on each cell server. From the non-BMC monitoring tool, we configured the SNMP traps to be sent to those two mcxa instances simultaneously. This dual-configuration helps to minimize the SNMP trap loss in case of network connection failure. It also helps to address the lack of out-of-box high-availability feature in BMC Event Adapter (mcxa).
For the cell knowledge base, we made a minor change in the auto-generated mcsnmptrapdmib.baroc file so that we could write one rule instead of 50+ rules for all 50+ OIDs. We also added a de-duplication rule to remove the duplicated SNMP traps from the 2nd mcxa instance.
For the rest of cell knowledge base, we followed our standard procedures to map, convert, filter, correlate, update, define actions, execute actions, send email, and create tickets. In a later post, I will go into more details of our standard procedures in our cell knowledge base that universally apply to events from all event sources.
Monday, July 29, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 5: Remedy integration IBRSD
However IBRSD is available only as part of BPPM server installation package in BPPM 9.0 while we needed it as part of BPPM agent installation package. We contacted BMC support but was told that they were not able to help at all. They did enter an enhancement request so hopefully we can see it packaged with BPPM agent in the future releases. Meanwhile we had to come up with a different way to install IBRSD on our cell servers. We copied the entire installation directory of IBRSD from BPPM server, added a few environment variables, and configured a new IBRSD instance in the copied directory. Fortunately the instances on both cell servers worked well.
By now I talked about how we architected BPPM server, BPPM agents, BPPM cells, and IBRSD in our environment for high availability, scalability, and performance. We used Microsoft Windows Clusters for BPPM server and BPPM agents. We used native application clusters for BPPM cells. We installed BPPM agent and integration service on integration service node. We installed BPPM cells, BMC Event Adapter, BMC Event Log Adapter, and IBRSD on cell server. By keeping BPPM cells completely separated from BPPM agents, not only we eliminated down time for BPPM cell failover, we also minimized the down time for BPPM agent failover. In addition, it offers better BPPM cell data protection by having duplicated event repositories. As an added bonus, it cost less since we needed fewer Microsoft Windows Cluster licenses.
Our implementation is somewhat different from what BMC recommends. On various documents and best practice webinars, BMC recommends to co-locate BPPM agents and BPPM cells on the same server and use disk-level OS clusters to achieve high availability. Had we gone with that recommendation, we would experience not only longer down time for failover, but also increased risk that another cell may fail at the secondary node.
Here are the lessons learned so far: To realize the highest ROI on BMC Software investment, business requirements should drive technical design. It is important to evaluate all options through due diligence. Performing due diligence does require the support from the management of the organization and systematic approach to test and verify the proposed model. Sometimes we need to think out of box as shown in the IBRSD example.
Monday, July 22, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 4: BPPM agent and integration service
For high availability, we decided on disk-level OS cluster for BPPM Portal agent as we could not find another option. But for BPPM PATROL integration service and its BPPM agent, there is another option with active/active failover. We were excited when we first learned that integration service supports active/active failover because it looked similar to the architecture of BMC Portal with active/active RSM failover. However upon close examination we realized that the major difference is that BPPM integration service and its BPPM agent have no concept of cluster while BMC Portal App server treats the active/active RSM pair as one cluster. Portal data from the same metrics and same instance will be stored in the same database table regardless which RSM was used as the "middleman". However, when PATROL data are sent to the 2nd BPPM integration service when the 1st integration service is unreachable, the data will not be stored in the same table in BPPM database as the data sent through the 1st integration service - thus not displayed in the same graph.
We used the same scoring system as described in the last post (Part 3) to compare disk-level OS cluster and active/active failover for BPPM Integration Service. Our comparison result showed that disk-level OS cluster scored 27 points (partial yes to #1, and yes to #2 and #3) while active/active failover scored 22 points (yes to #1, #3 and #6). Therefore, we decided to use Microsoft Windows Cluster for all integration service nodes where BPPM agents are installed.
Because all BPPM cells are installed on separate servers as I mentioned in my last post (Part 3), only BPPM agent and integration service are running on integration service nodes. We disabled all event related components such as cells, event adapter, and event log adapter on integration service nodes because they are already running on separate servers (we refer them as cell servers). In Microsoft Windows Cluster, all services within the cluster must be included in the failover group. When one service failed, the entire group must be moved to the secondary node. The more services are included in the failover group, the longer it takes to move the entire group and the higher the risk that another service may fail on the secondary node. Disk-level OS clusters such as Microsoft Windows Cluster always involve some down time during failover. However we managed to limit the down time to under 5 minutes by minimizing the services in the failover group to include only BPPM agent and Integration Service.
In the next couple of posts, I will go through the configuration details on those event-related components.
Monday, July 15, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 3: BPPM cell
In Our BEM 7.4 environment, we have 9 pairs of cells running in high availability as application clusters with 7 cells on Windows 2003 servers and 2 cells on Linux RHEL 5.5 64-Bit servers. We were happy with the configuration as we never experienced down time even during BEM upgrade from 7.2 to 7.4. In BPPM 9.0, since BMC suggested a disk-level OS cluster for BPPM cells, we decided to do a side-by-side comparison between application cluster and OS cluster. We used a 10-point scoring system for the following 4 criteria: 1) Can the (cell) pair failover with no down time; 2) Is the (cell) pair a cluster (treated as one by their consumers); 3) Can the sender automatically switch to the 2nd destination when the 1st destination is unreachable; 4) Can the sender buffer the content and resend if the destination is unreachable. Each yes is 10 points and each no is 0 point. Partial yes would get a score between 1 and 9. We also added 3 bonus points for 5) automatically backing up data storage; and 2 bonus points for 6) lower hardware and OS cost.
Our comparison result showed that cell application cluster option scored 45 points (yes to all 6) while OS cluster option scored 37 points (partial yes to #1 and yes to #2, #3, and #4). So we kept the same high availability configuration as in BEM 7.4. We installed our Windows cells on Windows 2008 servers with 'cell only' option, not sharing the servers with BPPM Agent and Integration Service. We decided to delay Linux cell migration to the next phase to minimize the involvement of another organization. Our test has shown that BEM 7.4 cells can integrate well with BPPM 9.0 cells since very little has been changed in BEM cell features and architecture.
To take advantage of BPPM Server's analytic features, we added another pair of external BPPM cells for BPPM internal events since we wanted all events to be processed in external cells first. By default, all BPPM internal events are sent to the embedded cell on BPPM server. After BPPM 9.0.20, we were able to make a configuration change in pronet.conf on BPPM server so that all internal events are sent to an external cell.
Monday, July 8, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 2: BPPM server
For high availability, we decided to use Microsoft Windows Cluster as recommended by BMC. We were fully aware that it could take up to 10 minutes for the secondary server to resume operation after the primary server failed - meaning no data collection, no service impact updating, and no GUI access for up to 10 minutes. But we could not find a better option for BPPM server high availability as BMC still does not support application-level failover for BPPM Server.
Data in BPPM server are sampled every 5 minutes even some data could be collected more frequently at the data source (e.g. PATROL agent). Unlike events, when the destination is unreachable, data are simply thrown away, not buffered. We would miss 2-3 data points during BPPM server failover. We realized that this is something we can compromise because trended data are not as critical as availability events in a hospital environment.
Similarly we can compromise no service impact updating for 10 minutes as long as the lower-level critical events can still generate emails and incident tickets. This means that we will only use the embedded SIM cell for emails and incident tickets related to service models. All raw events will be sent to remote cells first for BEM level emails, tickets, and actions. They will be forwarded to this embedded SIM cell for service impact analysis and probable cause analysis. In a later post, I will go through the details on how we configured high availability for remote cells.
Unfortunately the lack of GUI access for 10 minutes is not something we can compromise. 10 minutes could mean the difference between life and death in a hospital environment. A 10-minute enterprise visibility blackout is simply not an option. In a later post, I will talk about how we have overcome this limitation.
Since our Atrium version was older than 7.6.04, we could not install Atrium CMDB Extension before installing BPPM Server 9.0. But we have learned that we can enable this option in BPPM Server later after Atrium upgrade without re-installing BPPM Server.
Monday, July 1, 2013
Lessons Learned from Migrating BEM 7.4 to BPPM 9.0 - Part 1: Background
We run 9 pairs of BEM cells, 2 instances of IBRSD, 2 instances of bii4P3, 2 instances of IIWS, and 2 instances of BMC Event Adapters to process 12,000 events from 7 different monitoring tools and generate 800 automated Remedy tickets per day. I refer our architecture as 'cell cloud' because this robust and flexible event processing service is hosted by servers located in different data centers, on different operating systems, and even based on different versions of BEM releases for a while. Every component in the cloud is configured as seamless high availability at application level and all events to the cloud are buffered with no down time and no transaction loss to meet the highest business requirements by hospitals. Our event processing is based on 'cell extension' technology that I made extensive customization to the out-of-box cell knowledge base. By eliminating policies and standardizing event processing with dynamic data tables, our BEM implementation is powerful, flexible, and easy to maintain.
Being a large BMC customer in healthcare industry, we have been encouraged by BMC to migrate to BPPM. And we were constantly invited by BMC to attend BPPM briefings, roadmaps, demos, webinars, and Q&A sessions. Prior to BPPM 9.0, we participated in extensive evaluations on both BPPM 8.1 and BPPM 8.5. We have given BMC extensive feedback on the limitations in BPPM that had been holding us back from migrating to BPPM.
When we finally made a decision to migrate our BEM 7.4 to BPPM 9.0, our primary objective is to preserve all the scalability, performance, flexibility, and high availability in BEM 7.4. We are so proud that our 'cell cloud' technology survived emergency data center failover during Hurricane Sandy with no down time. We don't want to compromise any of these capabilities when upgrading to BPPM 9.0.
In the next few posts, I will share my experience and the lessons learned from migrating BEM 7.4 to BPPM 9.0. Your comments are greatly appreciated.