Thursday, September 9, 2010

Error Stack

Most people think of an error stack as the list of errors that is generated when there is an issue.  For example and ORA-00600 causes a few other errors and gives you a stack of errors to unravel.  In our case, earlier this week we had a stack of errors that caused a minor database outage.  We went live a few weeks ago with Oracle Real Application Clusters to support PeopleSoft Financials.  This was a major upgrade from our standalone HP-UX server.  Of course there were several things that we learned before going live, but a couple caught us by surprise.

With the previous architecture, we were using Data Guard and replicating to a hot disaster recovery site.  Shortly after go-live with RAC, we were working on getting this set up again.  I only go into this part because it was one of the things that caused our error stack.  In short, the "error stack" is below.  I will explain more shortly.

"Error Stack"

1.  Production database lost contact with DR
2.  No monitoring of DR site being up to date
3.  RMAN backups successful, but not deleting logs because they were needed for DR
4.  Archive log destination filling up
5.  Monitoring was set, but no e-mail or page notification

Error number one that would have avoided our production outage was the logs not reaching the Data Guard server.  We are still unsure of why the production site lost contact with DR.  The current theory was an incomplete setup of Data Guard Broker.  We were received TNS errors in the alert log, but these were not reported in OEM.

Before we went live with the RAC environment, we had Data Guard running for our single instance 10g database and had a script that checked the sequence number between production and DR to make sure we were up to date and would page if we got too far behind.  With the implementation of RAC, that script no longer worked and we did not get a chance to fix it when we turned on the Data Guard database.  That was error number two that if we had caught, we would not have had an outage.

Our RMAN backups are scripted and we use OEM pretty much only as a job scheduler.  This was setup originally because the company did not have OEM up and running, and when we did get it implemented, we didn't want to rely on it for backup/recovery because we did not have a DR plan that included OEM.  That way we could just run the scripts for backup/recovery in the event of a DR or an OEM outage.  The logs from the scripts are e-mailed to the DBA's, but unless there is a real problem, they come up as successful.  In this case, the backups of the archive logs were successful, but our scripts are supposed to delete the logs and that part was failing.  We did not know the delete was failing because the overall script and the backup itself was successful.  RMAN was not deleting the logs because they could not be transmitted to the DR site and RMAN was trying to protect the DR site.

Since the archive logs were not being deleted as we expected by RMAN, our archive log destination filled up.    We still have to figure out what is different between 10g and 11g here and I'm assuming for now that this is a new feature since we didn't really change anything else.  Since, we changed our scripts to do force delete on the archive logs because we can always restore them if we need them for DR.

The monitoring piece of the error stack is the one that bugs me the most.  I had set the monitoring up via OEM and the alert was actually triggered.  The kicker was that I forgot to set the e-mail and pager notifications.  If we had happened to be looking at the alert screen on OEM we would have seen it.  Obviously, this has been remedied and we are not receiving e-mail messages and pages.

So, there you go.  Our version of an error stack.  When unraveling all of this, I kept thinking to myself that any one of these would have saved us an outage.  Obviously a learning experience, but still frustrating to know we missed that many little things that added up to a big one.  Fortunately, we did get notified that the database was unavailable and the total outage was only about 15 minutes.  I guess it can always be worse!

Hope to see many people at OpenWorld.  One of our PeopleSoft administrators and I are presenting the implementation of our PeopleSoft upgrade and RAC implementation so if you want to chat look me up.  I will be posting something more soon about other plans for OpenWorld.

No comments:

Post a Comment