Vignette v6 Down!

This week was more exciting than I ever want to have for awhile.  The v6 servers started acting extremely slow on Monday around noon – they were maxing out the Oracle server at 100% cpu.  We tried restarting them several times and finally we just had to take them down.  One of the webservers had a corrupt hard drive, so we took the downtime to run chkdisk on it.  Meanwhile, we put in a support ticket with Vignette to figure out what was causing the Oracle database to max out at 100% cpu.

We tracked the issue down to the cstld process from the CDS in the Oracle logs.  The CDS was spawning multiples of them at startup, many of which were inactive, and they were spiking the CPU.  Vignette’s first step was to increase the logging on the CDS and all the webservers configured on it.  Their initial reaction to the Oracle cstld process was to increase the timeout for the cstld so that the CDS wouldn’t leave the inactive processes hanging.  We restarted the CDS, but it didn’t help any.

In the log files, Vignette noticed 2 strange occurences: 1) a web server that wasn’t responding and 2) an IP address it was trying to connect to that was timing out.  The first one turned out to be a phantom web server – one that had been improperly removed from IIS but not Vignette.  At startup, the CDS was trying to connect to it and configure it with information from the CMS.  This ended up being the cause of the 100% CPU usage on the Oracle box.  The second one was an IP that used to be used by a vignette web site, but now was in use on a non-vignette website on another server – which didn’t speak vignette.  I removed both from the system using their knowledge base article about phantom websites.  After a restart, the websites all came back up on that CDS and Oracle was happy.

Now to repeat the process on the other CDS…

The second CDS had 4 phantom websites, so I removed them following the same process.  After rebooting the server, the websites came back up and Oracle was still happy.  Now the question is why this happened on Monday at noon.  We had restarted the entire complex on the Thursday before with no ill affects.  What had caused the system to suddenly have issues with Oracle?

After letting the websites purr for half a day, I gave my team permission to start using the development environment again to fulfil website changes.  I told them they could save changes but not to launch until I gave the go ahead.  Meanwhile, I was testing launching on my own and there seemed to be no ill effects on Oracle.  I gave the go ahead and everything seemed fine for a few hours.  I had a 2 pm meeting which I went to, but an emergency ticket came in during the meeting about all the news items on a particular site not working.

I went to investigate and suddenly two other website were completely down… giving 404 errors.  We were seeing this earlier in the week but couldn’t track it down.  Well, it turns out people on the web team were making changes and not previewing the templates before launch.  There would be an error on the template that they made live and they also didn’t check the site to make sure it launched correctly.  Further laziness abounded in that no versions were saved so there was nothing to fall back on.  I ended up having to extract missing templates from the database and painstakingly debugging the sites to get them back up.  As it stands now, all the websites are working, except one site’s archive, which I need to do further debugging on.

After running through many of the sites trying to get them back up, I noticed that no one has been saving versions.  This prompted me to remove everyone’s access to development so we can go over proper procedures on Monday morning.  If versions had been saved, I could have done a quick restore of the faulty templates and brought the sites back up within a few minutes.  Instead, two sites were down for over a day, with lost revenue from their advertising, plus my lost personal time on nights and evenings.  This is a professional publishing company, we need the web designers to act professionally.  Quality over quantity.

I do want to give proper credit to the Vignette tech, who helped me through the Oracle issues, Sunny was a pleasure to work with and took steps to escalate the problem when I needed to.  She walked me through commands and made sure I knew what I had to do.  I’m not a Vignette expert and don’t want to be, but she treated me with respect and understanding.  Thanks Sunny.