Re: Very Large Arrays

Ostensibly, there appears to be something going on here not obvious from the code snippet you posted.

A 43 million element array of 20-ish character strings should come in well below 3GB. And your Fisher-Yates implementation is in-place, so should cause no memory growth. Ie. You should be well within your machines capacity without swapping and your shuffle should be completing within seconds.

Is there a better way to do this, other than the iterative slicing he's doing?

There are a couple of small improvements you could make to your shuffle:

Shuffling through a temporary rather than a slice assignment is slightly more efficient:

my $temp = $array->[ $i ];
$array->[ $i ] = $array->[ $j ];
$array->[ $j ] = $temp;
[download]

There is no reason to test for $i == $j.
Swapping a value with itself does not invalidate the shuffle, and it is cheaper to do a single redundant swap than test 43 millions time to avoid it.

That said, the savings from the above will make little dent on the current processing time and are completely negated by your tracing code.

You need to look beyond the F-Y function for the cause of its slowness. Eg. Are you consuming large amounts of data outside of that subroutine that could be pushing you into swapping?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re: Very Large Arrays Select or Download Code

Replies are listed 'Best First'.
Re^2: Very Large Arrays by dave_the_m (Monsignor) on Feb 14, 2012 at 09:59 UTC
A 43 million element array of 20-ish character strings should come in well below 4GB. On a 64-bit system with a modern perl, it uses about 3.5Gb, assuming that the strings and the array were created in the most efficient way possible. Based on the OPs comments, I would guess they aren't being generated in the most efficient way possible; and together with the overhead of whatever the rest of code does, that swap really is an issue. In that case, it may be worth the OP's while to try an initial sequential read of the the array to get all the elements swapped backed in, before hitting random elements. Whether this will increase performance is pure speculation at this point. Another possibility is to use a single large string, divided into blocks of 20 chars, to avoid the overhead of the array and SVs; then access individual elements using substr(). Dave.	[reply]
Re^3: Very Large Arrays by BrowserUk (Patriarch) on Feb 14, 2012 at 13:32 UTC
On a 64-bit system with a modern perl, it uses about 3.5Gb, assuming that the strings and the array were created in the most efficient way possible. As much as that? I based my estimate upon this (running 5.10.1 64-bit): `c:\test>p1 [0] Perl> sub shuffle { my $r = shift; for ( 0 .. $#$r ) { my $a = $_ + rand( @$r - $_ ); my $t = $r->[ $a ]; $r->[ $a ] = $r->[ $_ ]; $r->[ $_ ] = $t; }; };; [0] Perl> @a = 1 .. 1e6;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> push @a, @a;; [0] Perl> $t = time; shuffle \@a; print time - $t;; 20.9429998397827 [0] Perl> print total_size \@a;; 518218064` [download] That's 16 million integers occupying just under 500MB being shuffled in 20 seconds. I reasoned that 3 times that gives 48M in 1.5GB. Then replace the ints with the pointers and add 48mx20 (rounded up to 24) = 1.01GB for the strings themselves. gives 2.5GB total; and around 60 seconds to shuffle. Did I miss something? Another possibility is to use a single large string, divided into blocks of 20 chars, to avoid the overhead of the array and SVs; then access individual elements using substr(). I had similar thoughts. On the basis of the single pair of numbers the OP gave as an example, it looks like floats would be sufficient precision for his purposes. In which case, a packed string shuffled with substr seemed possible: `c:\test>p1 [0] Perl> sub shuffle { my $r = shift; my $n = length( $$r ) / 8; for ( 0 .. $n ) { my $a = $_ + rand( $n - $_ ); my $t = substr $$r, $a, 8; substr $$r, $a, 8, substr $$r, $_, 8; substr $$r, $_, 8, $t; }; };; [0] Perl> $a = pack 'f*', map -1e5 + rand(1e9), 1 .. 1e6;; [0] Perl> $a x= 96;; [0] Perl> print length $a;; 384000000 [0] Perl> $t = time; shuffle \$a; print time - $t;; 59.6950001716614` [download] It really rather surprised me that the shuffle time was so close to my estimate for the array of strings. (Despite needing 4 substrs per.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^4: Very Large Arrays by dave_the_m (Monsignor) on Feb 14, 2012 at 20:13 UTC
I reasoned that 3 times that gives 48M in 1.5GB. Then replace the ints with the pointers and add 48mx20 (rounded up to 24) = 1.01GB for the strings themselves. gives 2.5GB total; and around 60 seconds to shuffle. Did I miss something? Yes: an int is stored entirely within an SV head, whereas a string needs an SV body too, as well as the string itself (which may also have some malloc overhead). I used the following code to get my estimate: `my $max = 10_000_000; my @a; system "ps -flyp $$"; $a[$max] = 0; # pre-extend array $a[$_] = '4.90032E-8,1.25327E-7' for 0..$max; system "ps -flyp $$";` [download] Dave.	[reply] [d/l]
Re^2: Very Large Arrays by Desade (Initiate) on Feb 16, 2012 at 00:36 UTC
Yes, there is a great deal going on prior to this shuffle routine. The 43-mil array of number pairs being shuffled is generated from some calculations done earlier in the script, and those calcs themselves take two files each of size 2.6GB. Since this is not my code, it causes me no shame to say that those files are read in thus: `open(TEMP1, $overlap_files[$i]) \|\| die "\n\nCan't open $overlap_files[ +$i]!!\n\n"; open(TEMP2, $overlap_files[$j]) \|\| die "\n\nCan't open $overlap_files[ +$j]!!\n\n"; my @file1 = <TEMP1>; my @file2 = <TEMP2>; close TEMP1; close TEMP2;` [download] While it's doing that, you can watch the memory usage grow like a escalator to nowhere. Those arrays of strings then get iterated, split, parsed, calculated on, and eventually pairs of "interesting" values from each of them get pushed, one by one, onto the 43-mil array that is the subject of the shuffle. So by the time the shuffle gets called, we're quite a ways into the swap. Thank you all for all the analysis and ideas. You pretty much confirmed what I thought I was seeing, and despite my personal desire to re-write this thing in C, I think the biggest boon for my buck is going to simply be doubling the memory on this machine before starting on the analysis runs she has planned for me. It sounds like having this whole array in physical memory prior to calling the shuffle is likely to drastically reduce the run-time... more than anything else I can squeeze out with software optimization alone.	[reply] [d/l]
Re^3: Very Large Arrays by BrowserUk (Patriarch) on Feb 16, 2012 at 09:48 UTC
I think the biggest boon for my buck is going to simply be doubling the memory on this machine If the machine has the hardware capacity, that's definitely the easiest option. That said, you could probably make a few small changes to the script that would substantially reduce the memory requirement without having to rewrite it in C. For example, changing the small snippet you showed to the following will substantially reduce the memory requirement (at that point): `my $n = 0; open(TEMP1, $overlap_files[$i]) \|\| die "\n\nCan't open $overlap_files[ +$i]!!\n\n"; my $n = 0; my @file1; $file1[ $n++ ] = $_ while <TEMP1>; close TEMP1; open(TEMP2, $overlap_files[$j]) \|\| die "\n\nCan't open $overlap_files[ +$j]!!\n\n"; $n = 0; my @file2 $file2[ $n++ ] = $_ while <TEMP2>; close TEMP2;` [download] And building a big string instead of an array for the shuffle would reduce it further. Often, just a little inspection can show where large amounts of data can be incrementally thrown away as you are finished with it, and can add up to huge savings. For example, by the time you are ready to do the shuffle, have you finished with the input arrays? And, does the algorithm for building the list in interesting pairs require that both input arrays be loaded into memory in their entirety, or could you processes them 1 line at a time? Perhaps reading them in lockstep. Anyway, good luck. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^4: Very Large Arrays by Desade (Initiate) on Mar 16, 2012 at 16:12 UTC
As a follow-up, I mostly left the script alone, since the wife is in the final weeks of a really big analytic paper and I can't risk slowing her down with my debugging. But I did double the memory on the machine (to 12GB), and saw the clock-time cost of the Fisher-Yates shuffle drop from the ~hour mentioned above to about 45 seconds to do all 43-million iterations. So I think we can safely say it was a swapping issue... Thanks for all the kind suggestions and analysis!	[reply]