Re^4: Slow find/replace hex in Perl win32

Sorry about the delay... meetings.

The file processed quickly, but didn't seem to get through the whole file. I'm assuming the change only let the process work through the first 65KB of the file?

I'm trying to have the file processed as a constant, binary stream. I don't need Perl or Windows to perform any EOL conversions, or working on a line-by-line basis, for example. The intent is for the script to just find the hex string, replace it with another, leaving the rest of the file intact.

Comment on Re^4: Slow find/replace hex in Perl win32

Replies are listed 'Best First'.

Re^5: Slow find/replace hex in Perl win32
by BrowserUk (Patriarch) on Sep 30, 2010 at 02:39 UTC

I'm assuming the change only let the process work through the first 65KB of the file?

No. It did process the whole file, but in 64k chunks.

The reason it ran more quickly is because if the file doesn't contain newlines, -p will load the file as one huge single line.

As pointed out above, the problem with the processing files in chunks, is that if the search term straddles a 64k chunk--say 2 bytes at the end of one chunk, and two bytes at the beginning of the next, then the search term won't match and the substitution won't be made.

The really simple solution to that, it to process the file twice, with different buffer sizes chosen to be relatively prime. You might use 1MB for the first pass and 1MB -3 for the second. This will ensure than any overlaps missed by the first pass will not fall on a boundary on the second pass. Up to 1024GB anyway.

So,

perl -e"BEGIN{$/=\(1024**2)  }" 
-pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg" infile >outfile1

perl -e"BEGIN{$/=\(1024**2-3)}" 
-pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg" outfile1 >outfile2
[download]

Two passes is obviously slower than one, but much faster than loading the whole damn file into ram on a constrained machine.

This last point is what I assume to be the cause of the performance differential between your Linux and Windows set-ups. If the former has sufficient free ram to allow the whole file to be loaded in one pass, and the latter does not and moves into swapping, the difference is explained.

Another alternative would be to use a sliding buffer, but that too complicated for a one-liner, and often doesn't yield sufficient performance to beat the two-pass approach.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

[reply]
[d/l]
[select]


Clear questions and runnable code get the best and fastest answer
	PerlMonks