Re: mem usage
by ikegami (Patriarch) on May 26, 2010 at 17:34 UTC
|
Maybe fragmentation from extending the stack is using up a lot of memory. Try to avoid putting the entire file on the stack, and pre-extending its destination:
my @lines;
$#lines = 12_000_000;
@lines = ();
push @lines, $_ while <IN>;
Note that shuffle already exists in List::Util.
Update: Fixed bug.
| [reply] [d/l] [select] |
|
Thanks. It just barely squeezed thru -- interestingly the kernel killed firefox during the run.
| [reply] |
|
Woops, what I posted is very buggy. Should be
my @lines;
$#lines = 12_000_000;
@lines = (); <------ was missing
push @lines, $_ while <IN>;
$ perl -e'print "abcdef\n" for 1..11_000_000' |
perl -E'
$#a=11_000_000;
@a=();
push @a, $_ while <>;
say int(`ps --no-heading -o vsz $$`/1000)
'
480
480MB for 77MB file with 11 million lines.
| [reply] [d/l] [select] |
|
|
|
Re: mem usage
by almut (Canon) on May 26, 2010 at 17:47 UTC
|
Perl data structures (see PerlGuts Illustrated) need more memory than the mere user data they hold.
A quick check shows that reading 1_000_000 empty lines into an array (the way you do it) leads to a process size of 192MB on my machine (with Perl 5.10.1). So if you have 11_000_000 non-empty ones...
Update: using ikegami's suggestion, the memory usage for the same 1_000_000 empty lines reduces to 78MB without pre-extending, and (interestingly) 86MB with pre-extending the array ($#lines = 1_000_000). Update2: and 93MB with just $#lines = 100_000 (??)
Update3: with @lines = (), pre-extending the array no longer increases the memory requirements. But it doesn't help to reduce it either (presumably, it just improves speed).
| [reply] [d/l] [select] |
|
There was a bug causing the array to have twice the desired number of elements. I don't know if you were affected by it.
| [reply] |
|
Perl data structures (see PerlGuts Illustrated) need more memory than the mere user data they hold.
Will read. I presumed they used a bit more space, say a few pointers per element for an array -- but that is really really quite a lot.
I've also read somewhere that perl allocates in "power of 2" chunks. So when you add a 2nd element, you get allocated space for 4, then at 5 you get 8, at 9 you get 16 and so on. Is this true?
| [reply] |
|
Array overhead
+ ( 999*2+4 ) * size of a pointer
+ 1000 * size of the scalars
| [reply] [d/l] |
Re: mem usage
by crag (Novice) on May 26, 2010 at 22:13 UTC
|
If you don't mind reading the source file twice and a whole lot of random seeks the second time through, you can generate a list of offsets, shuffle those and then loop through a seek, read, write cycle:
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(shuffle);
my @offsets;
print STDERR "Scanning...";
open(IN, $ARGV[0]);
do { push @offsets, tell(IN) } while (<IN>);
close(IN);
pop @offsets;
print STDERR "Done. ($#offsets)\nScrambling...";
@offsets = shuffle(@offsets);
print STDERR "Done.\nWriting scrambled...";
open(IN, $ARGV[0]);
for (@offsets) {
seek(IN, $_, 0);
my $line = <IN>;
$line .= $/ if $line !~ qr{$/};
print $line;
}
print STDERR "Done.\n";
This scrambled an 800k line/60M file I had handy in eight seconds with minimal memory usage in process space. I assume the kernel kept the entire file cached in memory. | [reply] [d/l] |
Re: mem usage
by jwkrahn (Abbot) on May 26, 2010 at 18:07 UTC
|
| [reply] |
|
| [reply] |
|
| [reply] |
|
Yeah, that's Fisher-Yates (aka Knuth #12 or something). It works in place and I've never had any problems with it,* dunno why you think it's "not very good". Kind of a wacky place to put i-- but that's pretty trivial, style wise.
*(eg, just shuffled those 12 millon lines, no errors, since I am feeding that into something else I will notice if something gets duplicated et. al.)
| [reply] |
|
You can trim your shuffle a little by omitting the line: next if $i==$j;.
Swapping an item with itself doesn't affect the algorithm's fairness, and doing it once costs less, than testing n times and avoiding it once.
And you save a little more by avoiding the list assignment:
my $tmp = $array[ $i ];
$array[ $i ] = $array[ $j ];
$array[ $j ] = $tmp;
Doesn't look as nice though.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
Re: mem usage
by kejohm (Hermit) on May 27, 2010 at 02:58 UTC
|
#!perl
use strict;
use warnings;
use Tie::File;
my $filename = shift;
die 'No file' unless $filename;
my $tie = tie my @file, 'Tie::File', $filename, memory => 20_000_000;
die "Couldn't tie $filename: $!" unless $tie;
$tie->defer();
shuffle(\@file);
$tie->flush();
undef $tie;
untie @file;
sub shuffle {
my $deck = shift; # $deck is a reference to an array
my $i = @$deck;
while (--$i) {
my $j = int rand ($i+1);
@$deck[$i,$j] = @$deck[$j,$i];
}
}
__END__
This worked well enough for me using a simple text file containing 1 million lines of random numbers. | [reply] [d/l] |