Re^4: perl ST sort performance issue for large file? (copying)

That still involves an extra copying of each record. For long records, I've certainly seen that add up to a performance problem ...

On my platform at least, copying the records through a single buffer -- thus not incurring the need to request extra memory from the OS for that particular part of the operation -- doesn't seem to incur much of a penalty.

On a fairly modest, 50MB/1e6 record file:

C:\test>dir junk.dat
30/03/2012  17:39        53,777,008 junk.dat

C:\test>wc -l junk.dat
1000000 junk.dat

C:\test>head junk.dat
Some random junk 333740 119750 some more random junk
Some random junk 495178 597808 some more random junk
Some random junk 815582 448120 some more random junk
Some random junk 604675 737548 some more random junk
Some random junk 735839 436462 some more random junk
Some random junk 89324 842712 some more random junk
Some random junk 18493 967895 some more random junk
Some random junk 835571 146942 some more random junk
Some random junk 783477 720062 some more random junk
Some random junk 197540 926361 some more random junk
[download]

Doing the GRT in 3 stages:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

my $start = time;

my @a; $#a = 1e6;
my $n = 0;
$a[ $n++ ] = pack 'NNA*', m[(\d+) (\d+)], $_ while <>;

printf "%d records\n", scalar @a;

@a = sort @a;

substr( $_, 0, 8, '') for @a;

printf "Took %.6f seconds; check memory? ", time()- $start; <STDIN>;
[download]

Takes 7 seconds and just under double the amount of memory required by the original array:

C:\test>t-GRT junk.dat
1000001 records
Took 6.704649 seconds; check memory? 274.1MB
[download]

Doing it with your original formulation unmodified, but excluding the initial time to load the array from the equation:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

my @a; $#a = 1e6;
my $n = 0;
$a[ $n++ ] = $_ while <>;

printf "%d records; check memory", scalar @a; <STDIN>;

my $start = time;

my @sorted = @a[
    map { unpack 'N', substr($_,-4) }
    sort
    map { pack 'NNN', $a[ $_ ] =~ m[(\d+) (\d+)], $_ }
    0..$#a
];

printf "Took %.6f seconds; check memory? ", time()- $start; <STDIN>;
[download]

The sort takes 30 times longer and 4.5times as much memory as the original array:

C:\test>t-TYE junk.dat
1000001 records; check memory 143.8MB
Took 228.533225 seconds; check memory? 649.0 MB
[download]

Most of that time is spent going to the OS over and over to expand the stack piecemeal to accommodate the lists as they expand.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re^4: perl ST sort performance issue for large file? (copying) Select or Download Code


P is for Practical
	PerlMonks