Wednesday, May 20, 2009

Monitoring the Monitor

Occasionally we hear about government officials that are supposed to be watching out for our good and instead look out for their own. When this happens, people will sometimes ask, "Who was watching the watchers." I was reminded of this last week when our Oracle Enterprise Manager Grid Control server lost network connectivity and we were not notified. A very large thunderstorm passed through and we also found out the next morning that we had switched to generator power at least three separate times during the night. Sometime during one of those switch overs, the network interface lost contact.

Of course we did not realize it was down for about 14 hours. The company uses HP's Network Node Manager which raises alerts to the on-call help desk person, but for some reason, no call was made to the system administrator that the server was unavailable. Not only do we use Grid Control to monitor all of our Oracle and People Soft systems, but we also use it as a centralized job scheduler. Once the server was available again, I figured out that we missed about 150 jobs. No backups, no disaster recovery log maintenance, no nightly application maintenance jobs, no nothing for fourteen hours. The better part of the next day was spent running backups and getting our Data Guard instances caught up since the archivelog destinations filled up.

So now what? We considered writing a shell script and using cron on one of the production servers to ping the Grid Control server and e-mail if it was not available. Obviously, this is a bit "patchwork" but should at least tell us if the server is unavailable. Another option would be to set up Network Node Manager to page the system administrator directly if a server goes down. This would be my choice, and I plan to work to get this implemented, but at this time, I don't have access to that system. In the meantime we decided to use our test Grid Control system (yes, we have one) to monitor the production Grid Control server. Of course this means a separate Agent installed on the production server, but at least we will have visability into the system should something happen agin.

The question through all of this is how much is enough? I guess you could ask the same question about watching our government representatives... although I think they need to be watched even closer!

No comments:

Post a Comment