note
erroneousBollock
<blockquote>The problem is the size of the file is large(About 800Mb to 2Gb)</blockquote>
Repeating what was discussed in the CB, [tye] maintains [mod://Algorithm::Diff] so he'd be your best bet for issues related to the module.<p>
With such large files though, you may need to change your approach. Assuming that Algorithm::Diff is bogging down due to the large file size, the idea is that you should make sure that not so much of the data is loaded at one time.<p>
<h4>Method 1</h4>
Algorithm::Diff works from two arrays, right? If so perhaps the easiest thing to do is just to pass it (references to) two [mod://Tie::File] objects.<p>
<small><b>Update: </b> well, it's not that simple. Algorithm::Diff builds a hash to keep track of indexes, which grows (<i>O(?)</i>) with the length of the arrays passed.</small><p>
<h4>Method 2</h4>
If that didn't work (or was too slow) I'd do something like this (pseudocode):<p>
<small><code>
declare @differences
open file1 and file2
read chunk1 from file1 (64mbyte)
read chunk2 from file2 (64mbyte)
while (chunk1 contains data)
compute h1: md5hex of chunk1
compute h2: md5hex of chunk2
if h1 != h2
compute @diff: pass chunk1 and chunk2 to Algorithm::Diff
append @diff to @differences
end-of-if
read chunk1 from file1 (64mbyte)
read chunk2 from file2 (64mbyte)
end-of-while
close file1 and file2
#maybe combine adjacent deltas in @differences (eg: by line number)
report @differences
</code></small>
That way Algorithm::Diff only ever deals with 2x 64Mbyte of data at a time. If two adjacent chunks both contain differences between the files, you might investigate whether there's some way to combine those differences.<p>
-David
652114
652114