comment on

The first problem

You are using File::Slurp wrongly. (For a file of this size!)
When you call my $s = read_file( $filename );, it first reads the entire 500MB into an internal scalar, and then it returns it to you.
Where you then assign it to a scalar in your context.
You now have 2 copies of the data in memory: 1GB! And you haven't done anything with it yet.
You then run your regex on it, which takes around half a second on my machine and causes no memory growth.
Then you pass your copy of the data into write_file(), which means it gets copied onto the stack.
You now have 3 copies of the data in memory: 1.5GB!
And internally to write_file(), it gets copied again. You now have 4 copies of the data in memory: 2GB!
And if you are on a 32-bit Perl, you've blown your heap and get the eponymous "Out of memory!".
And if you are on a 64-bit perl with enough memory, it then spends an inordinate amount of time(*) futzing with the copied data "fixing up " that which isn't broken. Dog knows why it does this. It doesn't need to. Just typical O'Woe over-engineering!.
~~25 minutes+~~ 2 hours=!(**) (before I ^C'd it), to write 500MB of data to disk is ridiculous!
(**For a job that can be completed in 8 seconds simply, without trickery, 2 hours is as close to 'Never completes' as makes no difference.)
<Reveal this spoiler or all spoilers in this node or all in this thread>

How to use File::Slurp correctly. (For a file of this size!).

File::Slurp goes to (extraordinary) lengths in an attempt to "be efficient". (It fails miserably, but I'll get back to that!).
When reading the file, you can avoid the copying of the data, by requesting that the module return a reference to the data, thus avoiding the copying done by the return.
And when writing the file, you can pass that reference back. The module will (for no good reason) still copy the data internally before writing it out, but you do save another copy:
<Reveal this spoiler>

This way, you only have one redundant copy of the data in memory for a saving of 1GB Your process won't run out of memory.
However, it will still take ~~25 minutes+~~ 2 hours=! (I didn't wait any longer) to actually write 500MB to disk!

Your second mistake was using File::Slurp!

How about we try the same thing without the assistance of any overhyped, over-engineered, overblown modules.

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

print STDERR time;
my $s;
do{ local( @ARGV, $/ ) = $ARGV[0]; $s = <>; };
print STDERR time;

$s =~ tr[\t][ ];

print STDERR time;

open O, '>', $ARGV[1] or die $!;
{ local $\; print( O $s ); }
close O;

print STDERR time;

__END__
[ 0:57:20.47] C:\test>984648-3 500MB.csv junk.txt
1343779056.03211
1343779058.22142
1343779058.70098
1343779061.99852

[ 0:57:42.05] C:\test>
[download]

2 seconds to read it; 1/2 second to process it; 4 seconds to write it; and only 510MB memory used in the process!

That's efficient!

Bottom line: When you consider using a module for something -- LOOK INSIDE!. If it looks too complicated for what it does; it probably is.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

In reply to Re: Windows 7 Remove Tabs Out of Memory by BrowserUk
in thread Windows 7 Remove Tabs Out of Memory by tallums

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks

comment on

The first problem

How to use File::Slurp correctly. (For a file of this size!).

Your second mistake was using File::Slurp!

2 seconds to read it; 1/2 second to process it; 4 seconds to write it; and only 510MB memory used in the process!