Wednesday, December 30, 2009

Three Recoveries, Two Refreshes, One Delete

I feel like it's been a month since my last entry. Oh yeah, that's because it has been. I continue to be amazed with how fast time flies the older I get. Kids are supposed to make you feel younger and as they grow up, I just feel older!

The title of this post describes a full day I had a few weeks ago with a vendor application. Another title could have been "Support That Cannot Be Reached is No Support At All." The application in question has a relatively small database, but a couple hundred users. We actually converted to this centralized solution for the application about six months ago and the business owner has been converting locations two or three per week since.

I came into work to a my voice mail light... never a good sign. Apparently, they attempted to convert a large site the night before and had out grown the tablespace. "No biggie," I said and proceeded to increase the amount of allocated space. "We still can't get into the application and we cannot access any of the sites." This was very peculiar because this application has a different schema for each site and the only common information is stored in a "root" schema that manages some information on which databases are available. The user attempted to remove the affected schema from the application and I dropped it from the database, but this still did not allow the users into the application.

I'm not the most experienced person in the world, but I have dealt with quite a few applications during my career. I have never seen an application that just stopped working because of running out of space and especially not when more space was allocated. We contacted the vendor's support only to find out that their most experienced support analyst was out of town at a conference. The person we did get on the phone helped us check several things in the root schema, but couldn't seem to find anything that looked out of sync. The vendor suggested that we recover the database to prior to the load of the new schema. Knowing that the recovery wouldn't be a problem, I did have to warn them that since they had loaded the schema over twelve hours prior. This of course brings up other issues about notifications, etc, but that is a discussion for another blog posting.

Once we made the decision to go ahead and do the recovery, I looked at the dba_users view to find out when the affected schema had been created. Being the good DBA that I am, I copied the current data files to a backup location and started my RMAN recovery. Everything went perfectly and, as expected, the database came back up without the new schema. I called the application owner and told him to check it out. "It still doesn't work..." CRAP!!!!

This is the point when I start to wonder what is really going on here. It is also the point when the owner mentions that he actually loaded two locations the night before, not just the one. So, that would be recovery number two. I again figured out when the original schema was loaded and recovered to that earlier time. This time, when the owner tested the application, it worked. Okay, so at least we know which location caused the issue.

Users are allowed back into the application at this point and we plan for moving forward. The application owner says that he had tested that site in the test environment, but that he would do it again. When he tested it the second time, he found an error with the source data. Once the data was fixed, the load into test was successful. Somehow, I was talked into trying the load into production again since it worked in the test environment. I should have known better because as soon as he reloaded the site production was broken again. And that would be the explanation for recovery number three...

After the final recovery, we went ahead and copied the most current production data into the test environment. That would be refresh number one and when the data was loaded into the refreshed test environment, it broke it. At least we now had somewhere to try to figure out the issues. One more refresh and we waited to hear from the support person that new the most about the application. This is the part of the whole story that frustrated me the most. Once we had the right support person available, he was able to look and see that there was a data issue with one of the tables in the root schema. Deleted one row after loading the bad data and everything was fine. They deleted the row from the source data, reloaded into test, and everything worked. Being gun shy, we waited until the weekend to load the data into production just in case, but everything worked as expected, and he was able to load both sites without issue.

I learned a couple lessons throughout this ordeal. First, I had a gut feel that there was a data issue and not the tablespace issue that caused the application to not function. Unfortunately, I couldn't find any way to prove that and therefore had to go with the recovery that was being suggested by the vendor. This also leads to making sure the vendor has people available when needed to support issues. Of course, I didn't have anything to do about this because the vendor was chosen by the business. The other "stick to your guns" moment I had was when they wanted to load the data into production again after we thought it had been fixed. I should have insisted on a test refresh at that time to make sure we wouldn't break production again.

On the other hand, there were several good things that came out of this. My recoveries worked pretty well and I proved that I could go back to a point in time if necessary. The even better part of this was that the original data load was before the backup so we had to go back two nights and roll forward. I used to be a scripted backup and recovery person that could take or leave RMAN. After all these recoveries, to multiple points in time, I am an RMAN fan.

Well, I hope everyone had a great Christmas season, or whatever holiday you celebrated. Have a safe, happy, and prosperous new year. More posts soon. I really hope to get to this more often than I have recently. Also, make sure to look for information on Collaborate 10! See you in Vegas!