Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Windows 7 Remove Tabs Out of Memory

by BrowserUk (Pope)
on Jul 31, 2012 at 21:35 UTC ( #984664=note: print w/ replies, xml ) Need Help??


in reply to Windows 7 Remove Tabs Out of Memory

  1. The first problem

    You are using File::Slurp wrongly. (For a file of this size!)

    When you call my $s = read_file( $filename );, it first reads the entire 500MB into an internal scalar, and then it returns it to you.

    Where you then assign it to a scalar in your context.

    You now have 2 copies of the data in memory: 1GB! And you haven't done anything with it yet.

    You then run your regex on it, which takes around half a second on my machine and causes no memory growth.

    Then you pass your copy of the data into write_file(), which means it gets copied onto the stack.

    You now have 3 copies of the data in memory: 1.5GB!

    And internally to write_file(), it gets copied again. You now have 4 copies of the data in memory: 2GB!

    And if you are on a 32-bit Perl, you've blown your heap and get the eponymous "Out of memory!".

    And if you are on a 64-bit perl with enough memory, it then spends an inordinate amount of time(*) futzing with the copied data "fixing up " that which isn't broken. Dog knows why it does this. It doesn't need to. Just typical O'Woe over-engineering!.

    25 minutes+ 2 hours=!(**) (before I ^C'd it), to write 500MB of data to disk is ridiculous!

    (**For a job that can be completed in 8 seconds simply, without trickery, 2 hours is as close to 'Never completes' as makes no difference.)

    How to use File::Slurp correctly. (For a file of this size!).

    File::Slurp goes to (extraordinary) lengths in an attempt to "be efficient". (It fails miserably, but I'll get back to that!).

    When reading the file, you can avoid the copying of the data, by requesting that the module return a reference to the data, thus avoiding the copying done by the return.

    And when writing the file, you can pass that reference back. The module will (for no good reason) still copy the data internally before writing it out, but you do save another copy:

    This way, you only have one redundant copy of the data in memory for a saving of 1GB Your process won't run out of memory.

    However, it will still take 25 minutes+ 2 hours=! (I didn't wait any longer) to actually write 500MB to disk!

  2. Your second mistake was using File::Slurp!

    How about we try the same thing without the assistance of any overhyped, over-engineered, overblown modules.

    #! perl -slw use strict; use Time::HiRes qw[ time ]; print STDERR time; my $s; do{ local( @ARGV, $/ ) = $ARGV[0]; $s = <>; }; print STDERR time; $s =~ tr[\t][ ]; print STDERR time; open O, '>', $ARGV[1] or die $!; { local $\; print( O $s ); } close O; print STDERR time; __END__ [ 0:57:20.47] C:\test>984648-3 500MB.csv junk.txt 1343779056.03211 1343779058.22142 1343779058.70098 1343779061.99852 [ 0:57:42.05] C:\test>

    2 seconds to read it; 1/2 second to process it; 4 seconds to write it; and only 510MB memory used in the process!

    That's efficient!

Bottom line: When you consider using a module for something -- LOOK INSIDE!. If it looks too complicated for what it does; it probably is.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?


Comment on Re: Windows 7 Remove Tabs Out of Memory
Select or Download Code
Re^2: Windows 7 Remove Tabs Out of Memory
by Anonymous Monk on Aug 01, 2012 at 08:45 UTC

    2 seconds to read it; 1/2 second to process it; 4 seconds to write it; and only 510MB memory used in the process!

    That's efficient!

    Not really, from a memory standpoint. You could do much better with a standard loop that reads to a small buffer and writes to the output file in a loop.

    (Not to mention that you seem to have really fast disks (SSDs?). Haven't met a HDD yet that could read faster than 150 MB/s or write faster than 100 MB/s.)

    open my $in, '<', 'input.txt' or die; open my $out, '>', 'output.txt' or die; my $buf; while (read $in, $buf, 4096) { $buf =~ tr/\t/ /; print $out $buf; } close $_ for ($in, $out);

    But, this has a large potential to slow down the loop to around 10 MB/s because of properties of seeking media, and OS algorithms on read-ahead and flushing that never quite give that good performance [1]. Still a helluva lot better than the OS swapping you out because it can't fit the 500 MB into memory.

    [1] I have never seen an OS successfully avoid doing reading and writing in parallel (= sub-optimal) for cat largefile > otherfile

      (Not to mention that you seem to have really fast disks (SSDs?). Haven't met a HDD yet that could read faster than 150 MB/s or write faster than 100 MB/s.)

      Or maybe the OS simply caches the reads and delays the writes, leading to faster-than-disk performance.

      Not really, from a memory standpoint.

      See below...

      You could do much better with a standard loop that reads to a small buffer and writes to the output file in a loop.

      Actually no. That forces the OS to keep the disk head moving back and forth between source(*) and destination.

      One read & one write will always beat 125 iddy biddy reads and 125 iddy biddy writes, with a seek across the disk between each, hands down. (Not to mention 125 invocations of s/// or tr/// instead of one.)

      (Not to mention that you seem to have really fast disks (SSDs?).

      Not yet :) I waiting for a PCIe flash card that presents itself as additional (slow) ram at a reasonable price.

      Haven't met a HDD yet that could read faster than 150 MB/s or write faster than 100 MB/s.)

      As moritz points out: file system caching.

      The timings posted were not the first runs; but the same caching benefited all three versions.

      Not really, from a memory standpoint... Still a helluva lot better than the OS swapping you out because it can't fit the 500 MB into memory.

      Buy more!

      My last memory purchase:

      1 "Komputerbay 8GB (2 X 4GB) DDR3 DIMM (240 pin) 1600Mhz PC3 12800 8 G +B KIT (9-9-9-25) with Heatspreader for extra Cooling" 30.00 In stock Sold by: KOMPBAY

      (*Even if the input is cached from a previous read of the file, writing to disk before the entire input has been read is quite likely to cause some or all of the input file to be discarded before it has been read, to accommodate the output.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Actually no. That forces the OS to keep the disk head moving back and forth between source(*) and destination.

        It doesn't "force", actually. But as I stated in my post, I have yet to see an OS that could handle such a simple single-threaded situation intelligently enough. (That is, delay writing until file is closed or otherwise absolutely necessary.)

        BTW, there is something odd with your numbers. The Perl timestamps only give a delta of slightly less than six seconds (~83 MB/s average => reading 228 MB/s, writing 152 MB/s), but your console says almost 22 seconds (which would put the performance at around 23 MB/s)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://984664]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2014-10-02 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (49 votes), past polls