Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

CSV Diff Utility

by Limbic~Region (Chancellor)
on Jun 22, 2004 at 20:57 UTC ( #368857=perlquestion: print w/ replies, xml ) Need Help??
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All:
This is going to be a rather long post so I have done my best to use readmore tags and organize the information.
Preface:
A problem that is circulating around at work is how to reproduce the functionality of a product that is going away. Due to the way things are structured, I am more of a "you go do X" guy than someone involved in the actual solving of the problem. Since I like problem solving I am going to work on this on my own time and see how it compares to the solution(s) proposed from the team. I have not consulted google beyond a cursory glance so there very well maybe freeware/shareware/commercial wheels out there.

I have come up with the following assumptions and requirements

Assumptions:

  • The first row of all files will be a header row
  • There will be a key field in each record that will never change
  • The solution will need to be very "user friendly" - perhaps even a GUI

Requirements:

  • Ability to sort records prior to comparison by user chosen field (key)
  • Comparison level must be at least two levels
    • Record: Add/Modify/Delete
    • Field(s): Add/Modify/Delete
  • By default, all fields will be compared but can be overidden by subset of user-defined fields
  • Ability to toggle case sensitivity comparison at any level (one field is case sensitive while another is not)
  • Output must have at least two user selectable options
    • Format (plain text, HTML, csv, etc)
    • Field order
  • Ability to select and ignore added/removed columns when comparing at record level

Here is a very rough outline of the program logic I was thinking of:

  • Sort the files by the key field prior to comparison where the key field will be first followed by the remaining fields in ASCIIbetical order (potentially very expensive upfront operation)
  • Use a weave method so that only two records from each file need be loaded into memory at any one time
  • Step 1: If they match, proceed to individual field comparison if flag is set (Step 2)
  • If the "new" key is less than the "old" key, this is a new added record -> get another record from the new file and go back to step 1
  • If the "new" key is greater than the "old" key, the old record has been deleted -> get another record from the old file and go back to step 1
  • Step 2: If the field name matches, proceed to field value matching if flag is set (Step 3) - Note: If the user has only selected test at record level with no other options, it may be worthwhile to compare the raw lines from the files and not the individual fields
  • Follow same logic as Step 1 except paying attention to field ignore flags
  • Step 3: Determine if the data is new (old is blank and new is not), delete (new is blank and old is not), or modify (both are not empty but not the same) and then get the next record from both the new and old file

So what advice can you offer? Any code snippets, current available products, implementation strategies, etc, etc will all be very appreciated.

Cheers - L~R

Comment on CSV Diff Utility
Re: CSV Diff Utility
by pbeckingham (Parson) on Jun 22, 2004 at 21:39 UTC

    I think you need to expand your assumption list. Given that I can export unstructured junk in CSV format, from many applications, I don't think you can rely on a key field being unique. Certainly the various applications don't enforce it.

      pbeckingham,
      Try to remember that I am not trying to develop a generic utility that can diff any two CSV files. I am developing a solution to replace an existing product that is going away. The developers can work off that assumption because that is how it currently works.

      You did give me food for thought. Since it is unlikely that my solution will be used, I should come up with some that could easily be converted to be generic as so far, no one has mentioned an existing product.

      Cheers - L~R

Re: CSV Diff Utility
by jZed (Prior) on Jun 22, 2004 at 21:48 UTC
    This may be the old hammer-maintainer sees all things as nails, but if you use DBD::CSV you can leverage Text::CSV_XS's parsing speed and SQL's sorting and selecting options and DBIx::*'s rich set of output formats. If speed is an issue, use DBI::SQL::Nano as DBD::CSV's parser instead of SQL::Statement - it's much more limited but also much faster for simple key-field queries.

    Or then again, maybe you're asking about search logic rather than software, in which case the answer could depend on a the best combination of Perl and SQL processing.

      jZed,
      I absolutely hate it when I see people reply to proposed solutions with "That won't work". Sometimes it is followed by "because <lame excuse>". My favorite is "because I don't have permission to install modules" where the module in question is pure Perl.

      With that said, I do not know if this will work, but I can't go into specifics as to why because of, yes here it comes - the lame excuse you have all been waiting for, my government employer.

      Now, since I already doubt any solution I come up with will be used and it looks like there may not be an existing solution - I intend to fully explore this just because it sounds interesting.

      Cheers - L~R

Re: CSV Diff Utility
by graff (Chancellor) on Jun 23, 2004 at 02:10 UTC
    Sort the files by the key field prior to comparison where the key field will be first followed by the remaining fields in ASCIIbetical order (potentially very expensive upfront operation)

    If you have the unix-style "head", "tail" and "sort" utilities, you may be able to reduce the cost by opening the inputs this way:

    $old_header = `head -1 $old_csv`; # split that later # get $new_header same way, if it's different from $old_header open( OLD, "tail +2 $old_csv | sort |" ) or die $!; open( NEW, "tail +2 $new_csv | sort |" ) or die $!; # proceed with interleaved reading as planned...
    Of course, this assumes that each row of csv data does not contain newlines, which may be inappropriate. If there may be newlines within some data fields, you'll need to parse the csv first, then sort; in which case you might consider storing key values and byte offsets/lengths in a hash, so you can sort the hash keys, and rewrite the data records to a sorted file by seeking and reading each record in turn.
      graff,
      I have considered *nix utilities, but it requires more assumptions than just not having imbedded newlines. For instance, if the key field is the 3rd column, you need to sort by the 3rd column and not the first. CSV can get quite messy and even with the power of awk, it will likely have to be changed for each new type of CSV encountered.

      Originally, I figured if I needed to pre-process the file anyway, I might as well just go ahead and do the sort in Perl. That could potentially be a "bad" idea given the size of these files is currently unknown (at least to me). I really like the idea of indexing information and sorting that rather than the whole file. I will give this some more thought and perhaps will come up with a hybrid.

      Cheers - L~R

Re: CSV Diff Utility
by shotgunefx (Parson) on Jun 23, 2004 at 13:40 UTC
    For record level, perhaps it would be helpful to not only sort the files by the key, but also rearrange the files so that the unignored fields are first (sorted by name) and then the fields that are to be ignored (assuming you want to keep them) follow the required fields (sorted by name) Then you could do a cmp of the first fields concatinated for the record level test.


    -Lee

    "To be civilized is to deny one's nature."
      shotgunefx,
      I was rather suprised that I didn't get a half dozen, "why don't you just use X" responses. I fully intend to do more research before rolling my own, but if it looks like there is a hole to be filled I will be giving my approach a lot of thought. My rough outline of problem solving was about 15 minutes of thought while I was sitting in a meeting. I will take this, along with others advice under careful consideration if I proceed.

      Cheers - L~R

        Good luck. This post reminds me of a catcsv utility I once wrote for joing csv files. I think I'll find it and post it today in snippets.


        -Lee

        "To be civilized is to deny one's nature."
Re: CSV Diff Utility
by swngnmonk (Pilgrim) on Jun 23, 2004 at 16:35 UTC

    Others have said this as well, but the UNIX utilities (sort , diff, sed) will make your life MUCH easier here.

    I dealt with this problem a few years ago when I worked for a now-dead price-comparison site. We were getting CSV/TSV data dumps from online vendors daily, some of these files were 300+M in size (e.g. 500,000 books), and we only wanted what had changed from the previous dump.

    Our system was a pretty complex perl app, with config files for each vendor that described what the format of the file was, how to clean it up (none of them delivered 100% clean CSV files), what column to sort on, etc.

    The perl app didn't do any actual file processing itself - it was simply an easy way to handle config files and pass arguments to the various UNIX utils. It worked something like this:

    • Remove the header line (if needed) using head.
    • Use sed to clean up any potential issues, reformatting if necessary to make things easier for sort.
    • Use sort to re-order the file based on the unique-key column.
    • If necessary, use uniq to strip out duplicate rows (some of the vendors had multiple entries they didn't even know about).
    • diff the newly-generated file against the last one we processed, to see what changed.
    • Parse the results of the diff to determine which rows were adds, which were deletes, and which were changes. Those became the basis for SQL insert/delete/update statements against the main product DB.

    This saved our bacon. We were drowning in data (about 5G/day, when our average server was a 400Mhz Pentium w/ 256M of RAM and 10G of storage), and only about 3-5% of the rows in any given file changed from the previous dump.

    If your data is of any appreciable size, don't do the actual file-processing in perl, use the unix utils - it'll be much faster and more memory-efficient than anything you'll do in perl.

      swngnmonk,
      I have to vehemently disagree with you. The one person who suggested utils was only trying to offload the sorting and left the rest of the "actual file-processing" to Perl. graff realized one limitation - imbedded newlines. I went on to list others.

      Use sed to clean up any potential issues
      What are you talking about?
      CSV is like HTML - it should be parsed not have a single regex applied to it. While it is possible to build a sed parsing script - it is a nightmare to maintain and does not come close to having the logic functionality of Perl.

      Use sort to re-order the file based on the unique-key column
      How exactly do you propose to do that? CSV is not fixed width and with imbedded quote characters being escaped it doesn't sound too easy. While I think awk might be up to the challenge, it seems that each set of CSV files would require a unique solution.

      diff the newly-generated file against the last one we processed
      I guess you didn't read the requirements section of my post. Being able to have the user specify columns to be ignored in the comparison, being able to select which columns are case sensitive, being able to get which individual fields are different and not just the entire record.

      Finally, I indicated that this had to be "user friendly" and never mentioned speed as a bonus let alone a requirement. I am quite familiar with *nix utilities and don't hesitate to use them when they are the "right tool" for the job. They can't make my "life MUCH easier here" if they fail to meet the requirements.

      L~R

        I'll admit, I skimmed your requirements, and focused on the problem of diffing data files rather than specifics.

        That being said, you're combining two different problems:

        • Being able to compare two sets of data based on arbitrary criteria
        • User-friendliness

        The first is a complex problem, and if you're trying to do it in a simple fashion, you should probably use some kind of relational database to solve it. Others have already addressed that in previous postings

        You never made mention of the size of the data or performance requirements. Comparing data files against eachother becomes a very expensive process as the files get large. You mentioned that you only want to process one line at a time for each file. How do you know you're on the right line? How do you know the file is sorted properly?

        The solution we used worked for us, because we needed to know if *ANYTHING* changed. That, and it had to be lightweight and extremely fast. User-friendliness wasn't an issue, because all it had to do was generate a file of SQL statements.

        The issue of user-friendliness is a totally separate problem. It sounds to me like your web-based UI should be a very simple application that presents sort/diff criteria to use user, and passes those criteria to your data-processing package. In your place, I'd use CGI::Application to create that lightweight front-end for the user.

        While your other comments aren't relevant anymore, I feel like I should answer them in some part. sed was not being used to parse the CSV, only to re-format it into something that made life easier for sort.

        Newlines were never an issue, because newlines aren't legal CSV. When newlines appeared, we complained to the vendor to get them fixed. If they didn't fix it, we made a note of the processing error and chucked the row.

        sort takes delimiter arguments - that is how we were able to sort on an arbitrary field.

        Whatever you end up doing, good luck - it's definitely a challenging problem.

Re: CSV Diff Utility
by nubbel (Novice) on Oct 01, 2007 at 15:12 UTC
    maybe http://csvdiff.sourceforge.net/ is the answer to this (old) question.: "csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occours and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator. If you do so, columnnames are shown if a difference is found. I wrote csvdiff to compare two database unloadfiles, but you can use it for any kind of file which has separators."

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://368857]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2014-08-22 00:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (145 votes), past polls