Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

mem usage

by halfcountplus (Hermit)
on May 26, 2010 at 17:19 UTC ( #841760=perlquestion: print w/ replies, xml ) Need Help??
halfcountplus has asked for the wisdom of the Perl Monks concerning the following question:

I use the following short perl script to shuffle the lines in a file (to produce lists of random strings for testing):
#!/usr/bin/perl -w use strict; open(IN, "<".$ARGV[0]); my @lines = <IN>; close(IN); shuffle(\@lines); print $_ foreach (@lines); sub shuffle { my $array = shift; my $i = $#{$array}+1; while ($i != -1) { $i--; my $j = int rand ($i+1); next if $i==$j; @$array[$i,$j]=@$array[$j,$i]; } }
When I run this on a 70mb file, ~11 million lines, this eats about 1.5 gigs of ram (all I have left) then hits "Out of Memory!" before it's finished reading the file.

I have swap disabled, I suppose that would allow me to run this. However, I'm curious -- why does it need so much memory to store 70mb of data? Is there anything I can do here? I tried presizing the array to 12 million, that did not make any difference (can I presize each element, eg, request a block of a given size?).

Comment on mem usage
Download Code
Re: mem usage
by ikegami (Pope) on May 26, 2010 at 17:34 UTC

    Maybe fragmentation from extending the stack is using up a lot of memory. Try to avoid putting the entire file on the stack, and pre-extending its destination:

    my @lines; $#lines = 12_000_000; @lines = (); push @lines, $_ while <IN>;

    Note that shuffle already exists in List::Util.

    Update: Fixed bug.

      Thanks. It just barely squeezed thru -- interestingly the kernel killed firefox during the run.
        Woops, what I posted is very buggy. Should be
        my @lines; $#lines = 12_000_000; @lines = (); <------ was missing push @lines, $_ while <IN>;
        $ perl -e'print "abcdef\n" for 1..11_000_000' | perl -E' $#a=11_000_000; @a=(); push @a, $_ while <>; say int(`ps --no-heading -o vsz $$`/1000) ' 480

        480MB for 77MB file with 11 million lines.

Re: mem usage
by almut (Canon) on May 26, 2010 at 17:47 UTC

    Perl data structures (see PerlGuts Illustrated) need more memory than the mere user data they hold.

    A quick check shows that reading 1_000_000 empty lines into an array (the way you do it) leads to a process size of 192MB on my machine (with Perl 5.10.1).  So if you have 11_000_000 non-empty ones...

    Update: using ikegami's suggestion, the memory usage for the same 1_000_000 empty lines reduces to 78MB without pre-extending, and (interestingly) 86MB with pre-extending the array ($#lines = 1_000_000).   Update2: and 93MB with just $#lines = 100_000 (??)

    Update3: with @lines = (), pre-extending the array no longer increases the memory requirements. But it doesn't help to reduce it either (presumably, it just improves speed).

      Perl data structures (see PerlGuts Illustrated) need more memory than the mere user data they hold.

      Will read. I presumed they used a bit more space, say a few pointers per element for an array -- but that is really really quite a lot.

      I've also read somewhere that perl allocates in "power of 2" chunks. So when you add a 2nd element, you get allocated space for 4, then at 5 you get 8, at 9 you get 16 and so on. Is this true?

        Close. When it grows, the array size doubles in size plus 4. An array with 1000 scalars in it could take up as much as

        Array overhead + ( 999*2+4 ) * size of a pointer + 1000 * size of the scalars
      There was a bug causing the array to have twice the desired number of elements. I don't know if you were affected by it.
Re: mem usage
by jwkrahn (Monsignor) on May 26, 2010 at 18:07 UTC

      Your shuffle algorithm is not very good.

      It looks like a Fisher-Yates shuffle, which is what List::Util uses. Or did you spot a bug?

        Sorry, but the way he implemented it threw me off.

      Yeah, that's Fisher-Yates (aka Knuth #12 or something). It works in place and I've never had any problems with it,* dunno why you think it's "not very good". Kind of a wacky place to put i-- but that's pretty trivial, style wise.

      *(eg, just shuffled those 12 millon lines, no errors, since I am feeding that into something else I will notice if something gets duplicated et. al.)

        You can trim your shuffle a little by omitting the line:         next if $i==$j;.

        Swapping an item with itself doesn't affect the algorithm's fairness, and doing it once costs less, than testing n times and avoiding it once.

        And you save a little more by avoiding the list assignment:

        my $tmp = $array[ $i ]; $array[ $i ] = $array[ $j ]; $array[ $j ] = $tmp;

        Doesn't look as nice though.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: mem usage
by crag (Novice) on May 26, 2010 at 22:13 UTC
    If you don't mind reading the source file twice and a whole lot of random seeks the second time through, you can generate a list of offsets, shuffle those and then loop through a seek, read, write cycle:
    #!/usr/bin/perl use strict; use warnings; use List::Util qw(shuffle); my @offsets; print STDERR "Scanning..."; open(IN, $ARGV[0]); do { push @offsets, tell(IN) } while (<IN>); close(IN); pop @offsets; print STDERR "Done. ($#offsets)\nScrambling..."; @offsets = shuffle(@offsets); print STDERR "Done.\nWriting scrambled..."; open(IN, $ARGV[0]); for (@offsets) { seek(IN, $_, 0); my $line = <IN>; $line .= $/ if $line !~ qr{$/}; print $line; } print STDERR "Done.\n";
    This scrambled an 800k line/60M file I had handy in eight seconds with minimal memory usage in process space. I assume the kernel kept the entire file cached in memory.
Re: mem usage
by kejohm (Hermit) on May 27, 2010 at 02:58 UTC

    You could try using Tie::File with a Fisher-Yates shuffle in place:

    #!perl use strict; use warnings; use Tie::File; my $filename = shift; die 'No file' unless $filename; my $tie = tie my @file, 'Tie::File', $filename, memory => 20_000_000; die "Couldn't tie $filename: $!" unless $tie; $tie->defer(); shuffle(\@file); $tie->flush(); undef $tie; untie @file; sub shuffle { my $deck = shift; # $deck is a reference to an array my $i = @$deck; while (--$i) { my $j = int rand ($i+1); @$deck[$i,$j] = @$deck[$j,$i]; } } __END__

    This worked well enough for me using a simple text file containing 1 million lines of random numbers.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://841760]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2014-09-23 03:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (210 votes), past polls