in reply to Dirtiest Data

I have a number of Perl scripts which parse insurance claims data.

Some underwriters are able to change their format EVERY time they send us some data. Worst seem to be those who send us Excel-files as it is by far too easy for them to change it: swapping some columns around; changing the column titles; adding rows with sub-totals; ... The variations are without limits and once I change a script to take into account a "new" format or they revert to a previous format. The CVS-system has saved my sanity more than once!


"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Replies are listed 'Best First'.
Re^2: Dirtiest Data
by davidrw (Prior) on Jun 12, 2006 at 16:43 UTC
    or worse in excel, if they try to sort and don't select all the columns, they mix-n-match data .. e.g. everyone's name or address or something gets jumbled w.r.t their id..

    I was given a pile of shi^H^H^Hdata once where the name could be any of the following (sal==Salutation):
    First Last Last First Last, First F Last Sal Last Sal First Last Last, Sal Last Sal
    We just printed it and had someone visually match up with the ids.
Re^2: Dirtiest Data
by bassplayer (Monsignor) on Jun 12, 2006 at 20:47 UTC
    Heh heh. That reminds me of a client that my last company had. They would download their data into Excel, make any number of changes, then send us back the file to import. Inevitably the date fields had been mangled by Excel. Their web interface allowed them to make the changes they were making, but somehow they always insisted on doing it the hard way (for us, anyway.)


Re^2: Dirtiest Data
by dorward (Curate) on Jun 13, 2006 at 14:49 UTC

    Sounds a lot like my time working at a .com startup processing inventory files for electronic component traders.

    One company mananged to, in one week, provide three versions of their inventory. One in CSV, one in tab seperated form, and one in Excel format. Each had the columns in a different order. Each had different columns. Each had a different number of rows of contact information and notes before the data started. Utterly insane.

    That .com eventually folded (mostly because it depended on hitting critical mass with people searching and people uploading inventories. In retrospect, a better business plan might have been to write a FOSS inventory management system with a means to share inventories and search other people's inventories using a central site - and then charge for providing the central site and for support for the app. Hmm. I'm drifting, I'll stop now.