Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Windows 7 Remove Tabs Out of Memory

by tallums (Initiate)
on Jul 31, 2012 at 19:03 UTC ( [id://984648]=perlquestion: print w/replies, xml ) Need Help??

tallums has asked for the wisdom of the Perl Monks concerning the following question:

Hello - I'm using a Windows 7 laptop with 4GB RAM and I have a 500MB fixed-width text file from which I need to remove all tabs. I found this script online that does what I want on smaller (1KB) files, but gives me an "Out of Memory" error on the larger (500MB) file.

During testing, I noticed that if I comment out the write_file line that the script finishes without the "Out of Memory" error.

I don't know anything about Perl and I'm sure this is an easy fix for you guys. Your assistance/direction is much appreciated.

Below is the script I'm using.

use strict; use warnings; use File::Slurp; my $s = read_file('large_file.txt'); $s =~s/\t/ /g ; write_file('test.txt', $s); __END__

Thanks, Tim

Replies are listed 'Best First'.
Re: Windows 7 Remove Tabs Out of Memory
by BrowserUk (Patriarch) on Jul 31, 2012 at 21:35 UTC
    1. The first problem

      You are using File::Slurp wrongly. (For a file of this size!)

      When you call my $s = read_file( $filename );, it first reads the entire 500MB into an internal scalar, and then it returns it to you.

      Where you then assign it to a scalar in your context.

      You now have 2 copies of the data in memory: 1GB! And you haven't done anything with it yet.

      You then run your regex on it, which takes around half a second on my machine and causes no memory growth.

      Then you pass your copy of the data into write_file(), which means it gets copied onto the stack.

      You now have 3 copies of the data in memory: 1.5GB!

      And internally to write_file(), it gets copied again. You now have 4 copies of the data in memory: 2GB!

      And if you are on a 32-bit Perl, you've blown your heap and get the eponymous "Out of memory!".

      And if you are on a 64-bit perl with enough memory, it then spends an inordinate amount of time(*) futzing with the copied data "fixing up " that which isn't broken. Dog knows why it does this. It doesn't need to. Just typical O'Woe over-engineering!.

      25 minutes+ 2 hours=!(**) (before I ^C'd it), to write 500MB of data to disk is ridiculous!

      (**For a job that can be completed in 8 seconds simply, without trickery, 2 hours is as close to 'Never completes' as makes no difference.)

      #! perl -slw use strict; use File::Slurp; use Time::HiRes qw[ time ]; print STDERR time; my $s = read_file( $ARGV[0] ); print STDERR time; $s =~s/\t/ /g ; print STDERR time; write_file( $ARGV[1], $s ); print STDERR time; __END__ [21:35:14.40] C:\test>984648-1 500MB.csv junk.txt 1343767102.78642 1343767106.4356 1343767106.89558 Terminating on signal SIGINT(2)

      How to use File::Slurp correctly. (For a file of this size!).

      File::Slurp goes to (extraordinary) lengths in an attempt to "be efficient". (It fails miserably, but I'll get back to that!).

      When reading the file, you can avoid the copying of the data, by requesting that the module return a reference to the data, thus avoiding the copying done by the return.

      And when writing the file, you can pass that reference back. The module will (for no good reason) still copy the data internally before writing it out, but you do save another copy:

      #! perl -slw use strict; use File::Slurp; use Time::HiRes qw[ time ]; print STDERR time; my $s = read_file( $ARGV[0], scalar_ref => 1 ); print STDERR time; $$s =~s/\t/ /g ; print STDERR time; write_file( $ARGV[1], $s ); print STDERR time; __END__ [22:14:07.81] C:\test>984648-2 500MB.csv junk.txt 1343769390.96321 1343769394.24913 1343769394.70982 Terminating on signal SIGINT(2)

      This way, you only have one redundant copy of the data in memory for a saving of 1GB Your process won't run out of memory.

      However, it will still take 25 minutes+ 2 hours=! (I didn't wait any longer) to actually write 500MB to disk!

    2. Your second mistake was using File::Slurp!

      How about we try the same thing without the assistance of any overhyped, over-engineered, overblown modules.

      #! perl -slw use strict; use Time::HiRes qw[ time ]; print STDERR time; my $s; do{ local( @ARGV, $/ ) = $ARGV[0]; $s = <>; }; print STDERR time; $s =~ tr[\t][ ]; print STDERR time; open O, '>', $ARGV[1] or die $!; { local $\; print( O $s ); } close O; print STDERR time; __END__ [ 0:57:20.47] C:\test>984648-3 500MB.csv junk.txt 1343779056.03211 1343779058.22142 1343779058.70098 1343779061.99852 [ 0:57:42.05] C:\test>

      2 seconds to read it; 1/2 second to process it; 4 seconds to write it; and only 510MB memory used in the process!

      That's efficient!

    Bottom line: When you consider using a module for something -- LOOK INSIDE!. If it looks too complicated for what it does; it probably is.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      2 seconds to read it; 1/2 second to process it; 4 seconds to write it; and only 510MB memory used in the process!

      That's efficient!

      Not really, from a memory standpoint. You could do much better with a standard loop that reads to a small buffer and writes to the output file in a loop.

      (Not to mention that you seem to have really fast disks (SSDs?). Haven't met a HDD yet that could read faster than 150 MB/s or write faster than 100 MB/s.)

      open my $in, '<', 'input.txt' or die; open my $out, '>', 'output.txt' or die; my $buf; while (read $in, $buf, 4096) { $buf =~ tr/\t/ /; print $out $buf; } close $_ for ($in, $out);

      But, this has a large potential to slow down the loop to around 10 MB/s because of properties of seeking media, and OS algorithms on read-ahead and flushing that never quite give that good performance [1]. Still a helluva lot better than the OS swapping you out because it can't fit the 500 MB into memory.

      [1] I have never seen an OS successfully avoid doing reading and writing in parallel (= sub-optimal) for cat largefile > otherfile

        Not really, from a memory standpoint.

        See below...

        You could do much better with a standard loop that reads to a small buffer and writes to the output file in a loop.

        Actually no. That forces the OS to keep the disk head moving back and forth between source(*) and destination.

        One read & one write will always beat 125 iddy biddy reads and 125 iddy biddy writes, with a seek across the disk between each, hands down. (Not to mention 125 invocations of s/// or tr/// instead of one.)

        (Not to mention that you seem to have really fast disks (SSDs?).

        Not yet :) I waiting for a PCIe flash card that presents itself as additional (slow) ram at a reasonable price.

        Haven't met a HDD yet that could read faster than 150 MB/s or write faster than 100 MB/s.)

        As moritz points out: file system caching.

        The timings posted were not the first runs; but the same caching benefited all three versions.

        Not really, from a memory standpoint... Still a helluva lot better than the OS swapping you out because it can't fit the 500 MB into memory.

        Buy more!

        My last memory purchase:

        1 "Komputerbay 8GB (2 X 4GB) DDR3 DIMM (240 pin) 1600Mhz PC3 12800 8 G +B KIT (9-9-9-25) with Heatspreader for extra Cooling" £30.00 In stock Sold by: KOMPBAY

        (*Even if the input is cached from a previous read of the file, writing to disk before the entire input has been read is quite likely to cause some or all of the input file to be discarded before it has been read, to accommodate the output.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        (Not to mention that you seem to have really fast disks (SSDs?). Haven't met a HDD yet that could read faster than 150 MB/s or write faster than 100 MB/s.)

        Or maybe the OS simply caches the reads and delays the writes, leading to faster-than-disk performance.

Re: Windows 7 Remove Tabs Out of Memory
by toolic (Bishop) on Jul 31, 2012 at 19:30 UTC
    use warnings; use strict; use autodie; open my $fhi, '<', 'large_file.txt'; open my $fho, '>', 'test.txt'; while (<$fhi>) { s/\t/ /g; print $fho $_; }
Re: Windows 7 Remove Tabs Out of Memory
by Rudolf (Pilgrim) on Jul 31, 2012 at 19:33 UTC

    Tim, I suggest looping through the file yourself in any case, its fairly simple: and using tr/// instead of s/// is an improvement.

    open(OLD,'<','large_file.txt') or die "$!";#open old file my @old_file_contents = <OLD>; close(OLD); open(NEW,'>','test.txt') or die "$!";#open new file foreach my $line(@old_file_contents){ $line =~ tr/\t//d; #delete tabs print NEW $line; #print line to new file } close(NEW);

    UPDATE: I see your replacing them with a space, so maybe just try

    $s =~ tr/\t/ /;

      Thanks Rudolf! Thanks everybody!

      Rudolf's suggestion works for me :)

      Tim

Re: Windows 7 Remove Tabs Out of Memory
by bulk88 (Priest) on Jul 31, 2012 at 19:14 UTC
    You put a 500 MB file RAM. The regexp is going to be some multiple of the file size. Maybe there is some way to stop the regexp from "backtracking" to lower the peak of ram required for the regexp engine. This is the more complicated fix.

    The other choice is process the file by fixed blocks (512 KB or so) or with line buffered IO ("<FILE>") in a loop.

    Choice 3 is to try tr operator which is NOT a regexp.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://984648]
Approved by moritz
Front-paged by bulk88
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (7)
As of 2024-04-25 11:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found