I’ll be traveling to San Francisco next week with some work colleagues to attend the Web 2.0 Conference. I attended last year (see the ET Archive for coverage) had a great time. In the midst of our hectic Vignette v7 Migration this will be a welcome break for myself and our lead programmer. And, more importantly, a chance for us to brainstorm and get more ideas of how to push the company even further into the present and on into the future. I’ll be covering the conference on this blog. More to come….
Argh! Special characters in our v6 content has caused major headaches for our v7 migration scripts. Our v6 content entry system has no checking of the content entered into the database (other than perhaps a successful insert). Content Entry people have entered all sorts of strange characters into the system. Our migration scripts often are able to successfully enter the content into the v7 system, but the v7 system can’t handle the publishing of those fields into XML. The Vignette XML parser throws errors when it goes to publish the content into XML. We have had to make our migration scripts much more robust at error checking the content fields before the data is entered into the v7 system.
Another problem we’ve run across is all sorts of different data entered into fields it should not have been. Plus data being entered in multiple types of ways. For instance, we have a company profile content type, which has address fields, including city and state. There is no standardization of how users have entered the state… sometimes you get the full name: North Carolina, sometimes the postal appreviation: NC, sometimes a partial abbreviation: N. Carolina, and sometimes whatever they felt like typing: N. Carol. Granted that a lot of the problem is that these freelancers are copying what is in a print magazine issue, and there is no quality control process (or quality checking in the v6 content entry system).
The migration scripts have had to handle a lot more data cleansing than we had hoped to have to do. In our original migration project planning, we were going to clean the v6 data fields of any html or strange characters and do data cleanup of the states, etc. However, we didn’t really have time for the extra step due to the schedule that has been given to us. So… we’ll do the best we can, and spend a little extra time watching over the migration of each publication.
This week was more exciting than I ever want to have for awhile. The v6 servers started acting extremely slow on Monday around noon – they were maxing out the Oracle server at 100% cpu. We tried restarting them several times and finally we just had to take them down. One of the webservers had a corrupt hard drive, so we took the downtime to run chkdisk on it. Meanwhile, we put in a support ticket with Vignette to figure out what was causing the Oracle database to max out at 100% cpu.
We tracked the issue down to the cstld process from the CDS in the Oracle logs. The CDS was spawning multiples of them at startup, many of which were inactive, and they were spiking the CPU. Vignette’s first step was to increase the logging on the CDS and all the webservers configured on it. Their initial reaction to the Oracle cstld process was to increase the timeout for the cstld so that the CDS wouldn’t leave the inactive processes hanging. We restarted the CDS, but it didn’t help any.
In the log files, Vignette noticed 2 strange occurences: 1) a web server that wasn’t responding and 2) an IP address it was trying to connect to that was timing out. The first one turned out to be a phantom web server – one that had been improperly removed from IIS but not Vignette. At startup, the CDS was trying to connect to it and configure it with information from the CMS. This ended up being the cause of the 100% CPU usage on the Oracle box. The second one was an IP that used to be used by a vignette web site, but now was in use on a non-vignette website on another server – which didn’t speak vignette. I removed both from the system using their knowledge base article about phantom websites. After a restart, the websites all came back up on that CDS and Oracle was happy.
Now to repeat the process on the other CDS…
The second CDS had 4 phantom websites, so I removed them following the same process. After rebooting the server, the websites came back up and Oracle was still happy. Now the question is why this happened on Monday at noon. We had restarted the entire complex on the Thursday before with no ill affects. What had caused the system to suddenly have issues with Oracle?
After letting the websites purr for half a day, I gave my team permission to start using the development environment again to fulfil website changes. I told them they could save changes but not to launch until I gave the go ahead. Meanwhile, I was testing launching on my own and there seemed to be no ill effects on Oracle. I gave the go ahead and everything seemed fine for a few hours. I had a 2 pm meeting which I went to, but an emergency ticket came in during the meeting about all the news items on a particular site not working.
I went to investigate and suddenly two other website were completely down… giving 404 errors. We were seeing this earlier in the week but couldn’t track it down. Well, it turns out people on the web team were making changes and not previewing the templates before launch. There would be an error on the template that they made live and they also didn’t check the site to make sure it launched correctly. Further laziness abounded in that no versions were saved so there was nothing to fall back on. I ended up having to extract missing templates from the database and painstakingly debugging the sites to get them back up. As it stands now, all the websites are working, except one site’s archive, which I need to do further debugging on.
After running through many of the sites trying to get them back up, I noticed that no one has been saving versions. This prompted me to remove everyone’s access to development so we can go over proper procedures on Monday morning. If versions had been saved, I could have done a quick restore of the faulty templates and brought the sites back up within a few minutes. Instead, two sites were down for over a day, with lost revenue from their advertising, plus my lost personal time on nights and evenings. This is a professional publishing company, we need the web designers to act professionally. Quality over quantity.
I do want to give proper credit to the Vignette tech, who helped me through the Oracle issues, Sunny was a pleasure to work with and took steps to escalate the problem when I needed to. She walked me through commands and made sure I knew what I had to do. I’m not a Vignette expert and don’t want to be, but she treated me with respect and understanding. Thanks Sunny.