Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Dirtiest Data

by bassplayer (Monsignor)
on Jun 12, 2006 at 15:46 UTC ( #554841=perlmeditation: print w/replies, xml ) Need Help??

Greetings Monks,

I was given a horrible data file last week, one 'delimited' by variable numbers of spaces, and full of strange characters that choked the API I was sending the data to. There were plenty of little exceptions as well, where some rows would have fields delimited by one space, and I had to guess how to split full names into first and last, full names with many parts. Anyway, I was sure happy to have Perl along with me. This got me thinking about what other data horror stories there might be out there. Surely other monks have tales which will put mine to shame (and make me feel better.)

So, what's your dirtiest data experience? Please describe any cool Perl hacks you used to rise to the challenge.


Replies are listed 'Best First'.
Re: Dirtiest Data
by ptum (Priest) on Jun 12, 2006 at 21:22 UTC

    I think that nearly every time I have agreed to parse a variably-formatted data file, I have regretted it. My current practice for handling real-world problems like this is as follows:

    Customer: I have this data file from system X that I need to provision into system Y.

    Me: Does it have a standardized format that every record follows, either fixed-length columns, XML, or some kind of delimited columns?

    Customer: (lying through teeth) Yes, of course it does!

    Some time passes while I discover his web of lies ...

    Me: You weaselly liar! Your data is just a mish-mash of arbitrary values in a variety of formats! You expect me to make sense of this?

    Customer: Well, it actually can come in these N variations ... except when it doesn't.

    Me: OK. Tell you what. I'll write a script that will provision your data into system Y assuming it passes my validation routine. I'll write a validation routine separately, and any records that fail the validation will be displayed on a handy web report here (I specify some internal website).

    Customer: OK, sounds great.

    Some time passes and the customer (who never checked the web report) eventually discovers that only 5% of their data passes the validation and that they have no control over system X's output.

    Customer: Hey, none of my data is being provisioned! The developer for system X says it will be 6-8 years before he can change his output. Can't you loosen up your validation routine so my data gets provisioned?

    Me: No, because that would cause system Y to fail in a variety of ways, and would simply pass the buck.

    Customer: Waaaah!

    Having pity, I quietly find out where system X stores its data and build a script to acquire it in a standardized format. I provision the data with minimal validation problems and everyone is happy, except the programmer for system X, who is ultimately laid off when it is discovered that his system is non-essential and he is non-responsive. The end.

    Seriously, it often simply exacerbates the problem when we use Perl's power to cover up and pander to sloppy upstream programming or improper data input. Getting your data clean is often a 'pay me now or pay me later' situation ... and the cost when you pay later can be astronomical.

    Just an opinon, take it or leave it. :)

      That is a classic, classic post.

      I am particularly fond of the approach of "find out the DB the data is coming from" and go get it from there.

      Computer are so much easier to talk to than business people.

      Nothing is too wonderful to be true
      -- Michael Faraday

Re: Dirtiest Data
by Joost (Canon) on Jun 12, 2006 at 16:13 UTC
    Sounds pretty awful :-)

    I did a project some years ago that involved migrating a crusty old perl system to a new, well-structured, database-driven CMS. The old system's source data (tens of thousands of cross-referenced entries over multiple catagory systems) was supposed to be stored in XML. The new system could import XML data, but ofcourse not in the original format.

    The source data was delivered as about 1000 "XML" files in about 20 different "schemas" (ofcourse, no schema information was available at all). Hardly any of the files were valid XML, so first order of business was to fix <these kind="of < constructs > &">. That took about 2 days, since we had to conserve as much data as possible, so certain things just had to be done by hand.

    Then I spend the following 2 weeks trying to figure out the meaning of things like

    <entry someref="::1::4"/> <entry someref="namedref"/> <entry someref="4::name"/>
    Where the references could be to more or less anywhere else in any other file. Oh and then think up a way to convert them into our nice new model and/or change the new model. Fun times.

    Oh and don't get me started on people who mix UTF-8 and Latin-1 in the same file. :-)

Re: Dirtiest Data
by wfsp (Abbot) on Jun 12, 2006 at 19:03 UTC
    Too late to the party.

    Someone that "knows about computers", who we'll call "hacker", speaking to a graphic designer.

    hacker: I've been asked to create a database for the bookshop. You did the book list didn't you?

    designer: Yes, that's right.

    hacker: How did you get the data?

    designer: I typed it in.

    hacker: Into what?

    designer: Quark.

    hacker: Can you export from Quark into a spreadsheet, a csv file?

    designer: Ummm. I can make text files but each book has it's own text box and it might not work very well. And the columns come out weird.

    He was damned right about that. It took three days to type it all out again.

    Some of my best friends are graphic designers. :-)

      Some of my best friends are graphic designers. :-)

      Right, some of my ex-friends are graphics designers, too... oh, that's not what you said...

Re: Dirtiest Data
by graff (Chancellor) on Jun 13, 2006 at 06:08 UTC
    Coping with faulty data is one of the larger components of my job description -- all kinds of stuff:
    • detecting when an audio file has gone through a "text-mode" unix-to-dos filter;
    • fixing nasty anomalies in a database of 10,000+ names/addresses/phone numbers (and then looking for "duplicates", which of course would have been entered in slightly different ways each time);
    • spotting and removing all the forms of corruption conceivable in an archive of newswire data that was delivered via modem over a 10-year period (line noise like you couldn't imagine);
    • "normalizing" the contents of harvests from diverse web sites in various languages (and multiple encodings per language): get them all into unicode and convert them all to a simple, uniform xml text format;
    • ... others too numerous to mention

    Throughout it all, the most important thing for me has been to have a good set of diagnostic tools. The one tool I tend to use most often, as a first resort in the widest range of tasks, simply prints out a byte-value histogram, either as a 256-line list or as a nice 8-column x 32-row table, with an optional summary that counts up character categories like "printable ascii", "non-printable ascii", "8-bit", "iso-printable 8-bit", "digits", "whitespace", etc.

    With practice, you can figure out quite a lot about any sort of data just by viewing the distribution of byte values this way. If you have a specific expectation of what the data is supposed to be (ulaw audio? pcm? ascii text? text in some given language and encoding?), the byte histogram can tell you right away whether there's anything "out of band" (e.g. text shouldn't contain \x7f or null bytes, among other things), and whether any particular byte values have an unexpectedly high or low frequency of occurrence ("hmmm, too many \x0d bytes in this audio file..." or "this xml file has different quantities for '<' and '>'...")

    After that, additional diagnostic tools tend to get more "specialized" (ad hoc). But among these, the next most generic one produces a code-point histogram for wide-character text data in any of several distinct multi-byte encodings, and also reports any encoding errors that it finds -- good for knowing when a utf8 file alleged to be Russian happens to contain Arabic characters, and so on...

      The one tool I tend to use most often, as a first resort in the widest range of tasks, simply prints out a byte-value histogram, either as a 256-line list or as a nice 8-column x 32-row table, with an optional summary that counts up character categories like "printable ascii", "non-printable ascii", "8-bit", "iso-printable 8-bit", "digits", "whitespace", etc.

      Wow. Too bad I can only ++ once.

      Any chance there's a CPAN module to do that? If not, you should definitely write it! Moreover, this sounds like a great talk to give at seminar or conference. Or a great article for or The Perl Review.


      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        Any chance there's a CPAN module to do that? If not, you should definitely write it!

        um... well, <confession> the tool I referred to there is one that I actually wrote in C (so long ago, it was before I learned Perl) </confession>. "It ain't broke", so I've had no need to rewrite it. I sincerely apologize if it was inappropriate to discuss it here.

        Obviously a good Perl version to do the same thing would be a lot fewer lines of code than my C version, and most likely would not be significantly slower. But for the time being, I'm sorry that I must "leave it as an exercise for the reader..."

        (Update: I'm happy to share the C code with anyone who might want to try it out -- you can download it here: -- again, please forgive me for straying off-topic to non-Perl tools, and accept it in the spirit of PerlMonks, as an opportunity to adapt and enhance it in Perl.)

Re: Dirtiest Data
by CountZero (Bishop) on Jun 12, 2006 at 16:09 UTC
    I have a number of Perl scripts which parse insurance claims data.

    Some underwriters are able to change their format EVERY time they send us some data. Worst seem to be those who send us Excel-files as it is by far too easy for them to change it: swapping some columns around; changing the column titles; adding rows with sub-totals; ... The variations are without limits and once I change a script to take into account a "new" format or they revert to a previous format. The CVS-system has saved my sanity more than once!


    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      or worse in excel, if they try to sort and don't select all the columns, they mix-n-match data .. e.g. everyone's name or address or something gets jumbled w.r.t their id..

      I was given a pile of shi^H^H^Hdata once where the name could be any of the following (sal==Salutation):
      First Last Last First Last, First F Last Sal Last Sal First Last Last, Sal Last Sal
      We just printed it and had someone visually match up with the ids.
      Heh heh. That reminds me of a client that my last company had. They would download their data into Excel, make any number of changes, then send us back the file to import. Inevitably the date fields had been mangled by Excel. Their web interface allowed them to make the changes they were making, but somehow they always insisted on doing it the hard way (for us, anyway.)


      Sounds a lot like my time working at a .com startup processing inventory files for electronic component traders.

      One company mananged to, in one week, provide three versions of their inventory. One in CSV, one in tab seperated form, and one in Excel format. Each had the columns in a different order. Each had different columns. Each had a different number of rows of contact information and notes before the data started. Utterly insane.

      That .com eventually folded (mostly because it depended on hitting critical mass with people searching and people uploading inventories. In retrospect, a better business plan might have been to write a FOSS inventory management system with a means to share inventories and search other people's inventories using a central site - and then charge for providing the central site and for support for the app. Hmm. I'm drifting, I'll stop now.

Re: Dirtiest Data
by mojotoad (Monsignor) on Jun 12, 2006 at 16:21 UTC
Re: Dirtiest Data
by perrin (Chancellor) on Jun 13, 2006 at 04:58 UTC
    I have to say, this is one area where Perl is really amazing. I can recall several times when some regexes and the sheer ease of creating code in perl allowed me to parse formats that other people looked at and gave up on. Excel files with columns that kept moving around, multiple values that mean the same thing, and even mis-spelled names that were supposed to match (String::Approx!). The Java programmers I was working with would never have managed it.
Re: Dirtiest Data
by wazoox (Prior) on Jun 12, 2006 at 17:09 UTC
    I'm currently writing a set of modules to drive Rimage CD burners. The API is horrid, it's driven thru binary files with inconsistent field formatting :
    • some digital fields are bytes, while some other are strings of ASCII numbers;
    • all text fields are space-padded including filenames (what the hack may happen if the last character of a filename is a space?),
    • there is an incredible number of fields with meaningless names and ever more meaningless values :
    • for instance the field CONTROL 1 can have the values 0, 1, 2, M or P (there is of course a CONTROL 2 field; apparently these carry all functions that no other field was made for.)
    This file format is some sort of digital salad...
Re: Dirtiest Data
by Anonymous Monk on Jun 13, 2006 at 18:30 UTC
    So, what's your dirtiest data experience?

    I once tried to parse perl source code. It's hard! I see where "only perl can parse Perl" comes from! :-(

Re: Dirtiest Data
by hesco (Deacon) on Jun 19, 2006 at 02:38 UTC
    What a sweet little thread this has turned out to be. I had thought I was some sort of hot showy hacker when I learned my first few things about sed. And even now I only turn to perl at the point when sed and grep fail me.

    But it had not occurred to me all the sorts and levels of data that these tools would be useful for accessing. Never mind being crippled by coming from a country that thinks its people are literate if the only language we know is the colonial tongue imported some five hundred years ago. But that too shall pass. Parsing music with perl? Finding Arabic among Russian in unknown data? That sounds cool.

    -- Hugh

    if( $lal && $lol ) { $life++; }
Re: Dirtiest Data
by jesuashok (Curate) on Jun 13, 2006 at 13:27 UTC
    Hi monks,

    Even I had a very terrible experience with the Dirty data. There I faced a the folloing Issues :-
    1) The record willl be splitted into multiple lines. sometimes it would be 3 lines, some times it would be more than that. Then I applied a Intelligence to my script to solve that Issue by finding a unique factor from the data file.

    2) Then date filed in the data file willl be very horrible. some times it would be mmddyy or sometimes it would be yymmdd and so on. we got mad because of this data file and we found very difficult to load this data in Oracle. because oracle will not load the wrong dates. Perl helped a lot for solving all those types of issues.

    "Keep pouring your ideas"

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://554841]
Approved by Mutant
Front-paged by planetscape
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2021-04-18 09:50 GMT
Find Nodes?
    Voting Booth?

    No recent polls found