mem usage

halfcountplus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: mem usage by ikegami (Patriarch) on May 26, 2010 at 17:34 UTC
Maybe fragmentation from extending the stack is using up a lot of memory. Try to avoid putting the entire file on the stack, and pre-extending its destination: `my @lines; $#lines = 12_000_000; @lines = (); push @lines, $_ while <IN>;` [download] Note that `shuffle` already exists in List::Util. Update: Fixed bug.	[reply] [d/l] [select]
Re^2: mem usage by halfcountplus (Hermit) on May 26, 2010 at 18:02 UTC
Thanks. It just barely squeezed thru -- interestingly the kernel killed firefox during the run.	[reply]
Re^3: mem usage by ikegami (Patriarch) on May 26, 2010 at 18:10 UTC
Woops, what I posted is very buggy. Should be `my @lines; $#lines = 12_000_000; @lines = (); <------ was missing push @lines, $_ while <IN>;` [download] $ perl -e'print "abcdef\n" for 1..11_000_000' \| perl -E' $#a=11_000_000; @a=(); push @a, $_ while <>; say int(`ps --no-heading -o vsz $$`/1000) ' 480 [download] 480MB for 77MB file with 11 million lines.	[reply] [d/l] [select]
Re^4: mem usage by halfcountplus (Hermit) on May 26, 2010 at 18:40 UTC
Re^5: mem usage by ikegami (Patriarch) on May 26, 2010 at 19:00 UTC
Some notes below your chosen depth have not been shown here
Re: mem usage by almut (Canon) on May 26, 2010 at 17:47 UTC
Perl data structures (see PerlGuts Illustrated) need more memory than the mere user data they hold. A quick check shows that reading 1_000_000 empty lines into an array (the way you do it) leads to a process size of 192MB on my machine (with Perl 5.10.1). So if you have 11_000_000 non-empty ones... Update: using ikegami's suggestion, the memory usage for the same 1_000_000 empty lines reduces to 78MB without pre-extending, and (interestingly) 86MB with pre-extending the array (`$#lines = 1_000_000`). Update2: and 93MB with just `$#lines = 100_000` (??) Update3: with @lines = (), pre-extending the array no longer increases the memory requirements. But it doesn't help to reduce it either (presumably, it just improves speed).	[reply] [d/l] [select]
Re^2: mem usage by ikegami (Patriarch) on May 26, 2010 at 18:12 UTC
There was a bug causing the array to have twice the desired number of elements. I don't know if you were affected by it.	[reply]
Re^2: mem usage by halfcountplus (Hermit) on May 26, 2010 at 18:05 UTC
Perl data structures (see PerlGuts Illustrated) need more memory than the mere user data they hold. Will read. I presumed they used a bit more space, say a few pointers per element for an array -- but that is really really quite a lot. I've also read somewhere that perl allocates in "power of 2" chunks. So when you add a 2nd element, you get allocated space for 4, then at 5 you get 8, at 9 you get 16 and so on. Is this true?	[reply]
Re^3: mem usage by ikegami (Patriarch) on May 26, 2010 at 18:37 UTC
Close. When it grows, the array size doubles in size plus 4. An array with 1000 scalars in it could take up as much as `Array overhead + ( 9992+4 ) size of a pointer + 1000 * size of the scalars` [download]	[reply] [d/l]
Re: mem usage by crag (Novice) on May 26, 2010 at 22:13 UTC
If you don't mind reading the source file twice and a whole lot of random seeks the second time through, you can generate a list of offsets, shuffle those and then loop through a seek, read, write cycle: `#!/usr/bin/perl use strict; use warnings; use List::Util qw(shuffle); my @offsets; print STDERR "Scanning..."; open(IN, $ARGV[0]); do { push @offsets, tell(IN) } while (<IN>); close(IN); pop @offsets; print STDERR "Done. ($#offsets)\nScrambling..."; @offsets = shuffle(@offsets); print STDERR "Done.\nWriting scrambled..."; open(IN, $ARGV[0]); for (@offsets) { seek(IN, $_, 0); my $line = <IN>; $line .= $/ if $line !~ qr{$/}; print $line; } print STDERR "Done.\n";` [download] This scrambled an 800k line/60M file I had handy in eight seconds with minimal memory usage in process space. I assume the kernel kept the entire file cached in memory.	[reply] [d/l]
Re: mem usage by jwkrahn (Abbot) on May 26, 2010 at 18:07 UTC
Your shuffle algorithm is not very good. Better to use the one from the FAQ: How do I shuffle an array randomly? or use the shuffle function from the List::Util module.	[reply]
Re^2: mem usage by ikegami (Patriarch) on May 26, 2010 at 18:30 UTC
Your shuffle algorithm is not very good. It looks like a Fisher-Yates shuffle, which is what List::Util uses. Or did you spot a bug?	[reply]
Re^3: mem usage by jwkrahn (Abbot) on May 26, 2010 at 19:49 UTC
Sorry, but the way he implemented it threw me off.	[reply]
Re^2: mem usage by halfcountplus (Hermit) on May 26, 2010 at 18:43 UTC
Yeah, that's Fisher-Yates (aka Knuth #12 or something). It works in place and I've never had any problems with it,* dunno why you think it's "not very good". Kind of a wacky place to put i-- but that's pretty trivial, style wise. *(eg, just shuffled those 12 millon lines, no errors, since I am feeding that into something else I will notice if something gets duplicated et. al.)	[reply]
Re^3: mem usage by BrowserUk (Patriarch) on May 26, 2010 at 19:17 UTC
You can trim your shuffle a little by omitting the line: `next if $i==$j;`. Swapping an item with itself doesn't affect the algorithm's fairness, and doing it once costs less, than testing n times and avoiding it once. And you save a little more by avoiding the list assignment: `my $tmp = $array[ $i ]; $array[ $i ] = $array[ $j ]; $array[ $j ] = $tmp;` [download] Doesn't look as nice though. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^4: mem usage by ikegami (Patriarch) on May 26, 2010 at 19:46 UTC
Re^4: mem usage by halfcountplus (Hermit) on May 26, 2010 at 19:36 UTC
Re: mem usage by kejohm (Hermit) on May 27, 2010 at 02:58 UTC
You could try using Tie::File with a Fisher-Yates shuffle in place: `#!perl use strict; use warnings; use Tie::File; my $filename = shift; die 'No file' unless $filename; my $tie = tie my @file, 'Tie::File', $filename, memory => 20_000_000; die "Couldn't tie $filename: $!" unless $tie; $tie->defer(); shuffle(\@file); $tie->flush(); undef $tie; untie @file; sub shuffle { my $deck = shift; # $deck is a reference to an array my $i = @$deck; while (--$i) { my $j = int rand ($i+1); @$deck[$i,$j] = @$deck[$j,$i]; } } __END__` [download] This worked well enough for me using a simple text file containing 1 million lines of random numbers.	[reply] [d/l]


Your skill will accomplish what the force of many cannot
	PerlMonks