Argh! Special characters in our v6 content has caused major headaches for our v7 migration scripts. Our v6 content entry system has no checking of the content entered into the database (other than perhaps a successful insert). Content Entry people have entered all sorts of strange characters into the system. Our migration scripts often are able to successfully enter the content into the v7 system, but the v7 system can’t handle the publishing of those fields into XML. The Vignette XML parser throws errors when it goes to publish the content into XML. We have had to make our migration scripts much more robust at error checking the content fields before the data is entered into the v7 system.
Another problem we’ve run across is all sorts of different data entered into fields it should not have been. Plus data being entered in multiple types of ways. For instance, we have a company profile content type, which has address fields, including city and state. There is no standardization of how users have entered the state… sometimes you get the full name: North Carolina, sometimes the postal appreviation: NC, sometimes a partial abbreviation: N. Carolina, and sometimes whatever they felt like typing: N. Carol. Granted that a lot of the problem is that these freelancers are copying what is in a print magazine issue, and there is no quality control process (or quality checking in the v6 content entry system).
The migration scripts have had to handle a lot more data cleansing than we had hoped to have to do. In our original migration project planning, we were going to clean the v6 data fields of any html or strange characters and do data cleanup of the states, etc. However, we didn’t really have time for the extra step due to the schedule that has been given to us. So… we’ll do the best we can, and spend a little extra time watching over the migration of each publication.