Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Slow find/replace hex in Perl win32

by TomDLux (Vicar)
on Sep 29, 2010 at 20:54 UTC ( [id://862699]=note: print w/replies, xml ) Need Help??


in reply to Slow find/replace hex in Perl win32

To figure out what is happening, I would start by adding some print statements, to get a handle on what is happening. Start with something just before and just after opening the file, just after reading a line, ...

That should help narrow down where your processing is hanging up.

Once you've got it running through the loop, and it seems to be working, comment out the prints and time the program processing a 1 line file to completion, then 10, 100, 1000, 10000, 100000 line data files. What's the trend? What's the expected processing time for 42 million lines?

As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Replies are listed 'Best First'.
Re^2: Slow find/replace hex in Perl win32
by rickyboone (Novice) on Sep 30, 2010 at 01:59 UTC

    Just to clarify, there are no line-endings in this file (at least not in ASCII).

    I do think I found the problem, though. I didn't realize that perl was trying to find the end of line. Searching for that, I found "slurp mode", -0777 (undefined record separator). And using a few other recommendations, I also reduced the s///sgx options to just s///g, since my example didn't seem to need s and x. It seems to allow the file to be processed in a matter of seconds, and compares properly to other files processed "manually" with hex editors.

    perl -0777 -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/g" input > output

    I'm waiting on the availability of another file to test another hex string against, but it won't be available until Oct 1. I think the issue is resolved, but I'd like to wait until then to be sure, unless anyone else has any recommendations or considerations I should be aware of.

      Okay, well I think the code is doing what I want it to do, however I've run into a new problem... "Out of memory!" errors. The file is greater than 2GB, which is more than the available memory space for applications in 32-bit Windows. I'm going to try booting the server with /3GB or /PAE to workaround the issue.
        I'm going to try booting the server with /3GB or /PAE to workaround the issue.

        If that works, it'll will only be a matter of time before the file grows bigger than memory again.

        Did you try the two-pass solution. A tad slower, but it'll never run out of memory. It can handle files upto 1024GB as posted using a 1MB buffer.

        And if 1 Terabyte becomes limiting, increasing the buffer size to 2MB means it can handle 4 TB. A 4MB buffer takes you to 16TB; and so on.

        You can even avoid the need to make two (disk) passes. Simply pipe the output of the first pass to the input of the second:

        perl -e"BEGIN{$/=\(1024**2) }" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\x +f0/sg" infile | perl -e"BEGIN{$/=\(1024**2-3)}" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg" >outfile2

        It still makes two passes of the data, but only reads and writes the disk once for each block.

        To demonstrate that it works. Given the input file fred:

        c:\test>type fred 1234567890123456789012345678901234567890123456789012345678901234567890 +123456789012345678901234567890123456789012345678901234567890

        Using one pass, with a search term that straddles the buffer boundaries, no changes are made:

        c:\test>perl -e"BEGIN{$/=\10}" -pe" s[8901][abcd]" fred > joe c:\test>type joe 1234567890123456789012345678901234567890123456789012345678901234567890 +123456789012345678901234567890123456789012345678901234567890

        But after two piped passes:

        c:\test>perl -e"BEGIN{$/=\10}" -pe" s[8901][abcd]g" fred | perl -e"BEG +IN{$/=\7}" -pe"s[8901][abcd]g" >joe

        The changes are made:

        c:\test>type joe 1234567abcd234567abcd2345678901234567abcd2345678901234567abcd234567890 +1234567abcd234567abcd2345678901234567abcd2345678901234567890

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://862699]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 12:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found