Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Compare large files

by boardryder (Novice)
on Jul 09, 2009 at 19:35 UTC ( [id://778679]=perlquestion: print w/replies, xml ) Need Help??

boardryder has asked for the wisdom of the Perl Monks concerning the following question:

I've been doing a bit of research here and have found a number of nodes with some helpful information, but none I believe represents quite what I'm trying to accomplish.

I have two similar large files 1GB+ in size. I need to compare file1 to file2 with common keys to find differences in the values of file1 to file2. My initial thought was to load both into separate hashes and iterate hash1 over hash2. However, with files of this size I cannot store in memory. I've read of a few modules that may be of help and a lot of suggestion of using a RDBMS. Looking for some help and direction. Thanks!
sub process { open (TODAY_IN, $out); foreach my $line_t (<TODAY_IN>){ ($path, $type, $size, $link, $nfiles) = split(/\s+/, $line_t); $today{$path}{'Type'} = $type; $today{$path}{'Size'} = $size; $today{$path}{'Link'} = $link; $today{$path}{'NFILES'} = $nfiles; } close TODAY_IN; open (YESTERDAY_IN, $comp); foreach my $line_y (<YESTERDAY_IN>){ ($path, $type, $size, $link, $nfiles) = split(/\s+/, $line_y); $yesterday{$path}{'Type'} = $type; $yesterday{$path}{'Size'} = $size; $yesterday{$path}{'Link'} = $link; $yesterday{$path}{'NFILES'} = $nfiles; } close YESTERDAY_IN; diff(%today, %yesterday); } sub diff { open (COMP, ">$final"); foreach $key (keys %today){ if (exists $yesterday{$key}){ $size_t = $today{$key}{'Size'}; $size_y = $yesterday{$key}{'Size'}; $nfiles_t = $today{$key}{'NFILES'}; $nfiles_y = $yesterday{$key}{'NFILES'}; if ($size_y > 0 && $size_t > 0){ if ($size_t > $size_y){ my $diff_t = (1-($size_y/$size_t))*100; if ($diff_t >= $max_size_diff){ $diffOut{$key}{'SizeYest'} = $size_y; $diffOut{$key}{'SizeToday'} = $size_t; $diffOut{$key}{'SizeDiff'} = $diff_t; print COMP "$key\tYEST:$diffOut{$key}{'SizeYest'}\tT +OD:$diffOut{$key}{'SizeToday'}\tDIFF:$diffOut{$key}{'SizeDiff'}\n"; } } elsif ($size_y > $size_t){ my $diff_y = (1-($size_t/$size_y))*100; if ($diff_y >= $max_size_diff){ $diffOut{$key}{'SizeToday'} = $size_t; $diffOut{$key}{'SizeYest'} = $size_y; $diffOut{$key}{'SizeDiff'} = $diff_y; print COMP "$key\tYEST:$diffOut{$key}{'SizeYest'}\tT +OD:$diffOut{$key}{'SizeToday'}\tDIFF:$diffOut{$key}{'SizeDiff'}\n"; } } if (-d $key){ if ($nfiles_y > 0 && $nfiles_t > 0){ $diffFiles = $nfiles_t-$nfiles_y; if ($diffFiles > $max_file_diff){ $diffOut{$key}{'FileDiff'} = $diffFiles; print COMP "$key\tFDIFF:$diffOut{$key}{'FileDiff' +}\n"; } } } } } else { $diffOut{$key}{'SizeToday'} = $size_t; $diffOut{$key}{'SizeYest'} = 0; $diffOut{$key}{'SizeDiff'} = "New"; print COMP "$key\tYEST:$diffOut{$key}{'SizeYest'}\tTOD:$diffO +ut{$key}{'SizeToday'}\tDIFF:$diffOut{$key}{'SizeDiff'}\n"; } } close COMP; }

Replies are listed 'Best First'.
Re: Compare large files
by graff (Chancellor) on Jul 09, 2009 at 23:06 UTC
    I recall making the suggestion in your previous thread, which got you started down this path, and part of my suggestion was to make sure that these output files themselves be created in sorted order, so that you would not have to sort them later, and comparison of two files would be much easier.

    But based on the data sample you showed in one of your replies here, it looks like the files are not sorted. So the problem you need to fix is in the program that produces these files -- they should be written in sorted order.

    Then you can use the standard "diff" utility, which will correctly show:

    • lines in file1 absent from file2
    • lines in file2 not present in file1
    • lines where some portion of file1 content differs from file2 content

    And "diff" already knows how to manage big files -- it might take a while, but I'm pretty sure it will finish.

    Also, it might help if you consider breaking your outputs into smaller pieces. How hard/bad would it be to have your directory scan process create 10 files of 100 MB each on average (or 100 files of 10 MB each on average)? I think the directory structure should provide a sensible way to do that...

    (update/ In fact, it might be worthwhile to simply create one tabulation file per directory -- I believe you start with a list of the directories being scanned, so the task becomes: create and compare table files for each directory in the list; that should be pretty simple to maintain, and will run as quick as any other approach. /update)

    One last point, again based on the data sample you posted above. Are you sure that all differences are equally important and relevant? If yes, then using diff is fine. If not, either adjust the script that creates these files, to avoid cases where unimportant differences are present in the data, or else you'll have to write your own customized perl variant of diff (or better yet, a filter on the output from diff) to exclude unimportant differences.

      I did try and create one file per directory based on your excellent example provided on on my other thread. After completing half of my directory scan, it ended up creating nearly 500,000 files and took nearly 30 minutes just to do a listing so I started back here again.

      It looks like I have several ideas to implement now and my options are clear. I'm going to attempt sorting two large files then use comm -3to filter the diffs as that seems the most straight forward to at least get this working.

      Thanks All.
Re: Compare large files
by JavaFan (Canon) on Jul 09, 2009 at 19:58 UTC
    I'm not quite sure how you're comparing things, but won't something like the following do:
    $ perl -ple 's/\s+/ /g' today | sort > today.s $ perl -ple 's/\s+/ /g' yesterday | sort > yesterday.s $ comm -3 today.s yesterday.s
    Or as a one liner (bash syntax):
    $ comm -3 <(perl -ple 's/\s+/ /g' today | sort) <(perl -ple 's/\s+/ /g +' yesterday | sort)
      I need to try it out, but it does look like it could work. Only I would need to create two new files using more disk space, and I was hoping to use pure perl. I also need to see how to work the output of comm.

      My data looks as follows. The keys I was referring to is the path of a file or directory, where file/directory sizes can change and files/directories may/may not exist in file1 compared file2
      File1: /home/users/ DIR 5555 /home/users/file FILE 324 /home/users/file2 FILE 435 .... .... File2: /home/users/file FILE 555 /home/users/ DIR 5888 /home/users/file2 FILE 435 .... ....

      If you can't use any command line tools (such as comm as sugested), sort both files (using the sort utility) and read lines from both files, comparing them on the fly. This will enable you to compare arbitrarily large files with minimal overhead.

        Be warned. In Linux you generally should set the environment variable LC_ALL to C before using sort. Otherwise its idea of sorted order does potentially inconvenient things like:
        1,10 11,1 1,123
        (What? You were expecting all of the things with ID 1 to be grouped together? Silly programmer, read the documentation!)
        If you can't use any command line tools, you can't use sort either....
      Would diff -B be a better choice here for ignoring whitespace differences?

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://778679]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (2)
As of 2024-04-20 03:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found