Re: (FASTER) Example of TMTOWTDI

Maybe it's an even dirtier hack, but this one can speed up things a lot.
Your script saves time by skipping chomp. My addition will save time by avoiding a hidden join for each array interpolation. I found this trick when I was benchmarking anagram algorithms.
I used a 116_246 words dictionary, and I got a significative speed improvement, as you can see from the relative times.

$ wc words.english
 116246  116246  989075 words.english

$ time perl dict.pl words.english # first hack
1.38    user
0.05    system
0:01.43 elapsed
$ time perl dict2.pl words.english  # second hack
0.86     user
0.02     system
0:00.88 elapsed
[download]

Change

push @{$Word{length$_}},$_ while (<>);
# ...
print OUT @{$Word{$_}};
[download]

into

$Word{length$_} .= $_ while (<>);
# ...
print OUT $Word{$_};
[download]

BTW: ChOas, congratulations for your sanctification!

update
There is always room for improvement, though. ...
Combining the first array solution with the other tricks, we have a further 10% improvement.

#!/usr/bin/perl -w
use strict;
my @words;
my $file;
$words[length$_] .= $_ while (<>);
for (0..$#words)
{
 next unless $words[$_];
 $file=$_-1 . ".mine";
 open OUT,">$file" or die "Cannot open $file.mine: $!\n";
 print OUT $words[$_];
 close OUT;
};
[download]

update (2)
I am afraid that I must have spoiled demerphq's comparison with my previous update. ;)
He very chivalrously included such update and compared the methods shown so far, offering an improvement, that was unfortunately slower than this array + concatenation hack.
Nice Job. Keep up the good work, bro, and many thanks for your analysis.

 _  _ _  _  
(_|| | |(_|><
 _|

Comment on Re: (FASTER) Example of TMTOWTDI Select or Download Code

Replies are listed 'Best First'.

Re: Re: (FASTER) Example of TMTOWTDI
by demerphq (Chancellor) on Mar 22, 2002 at 11:53 UTC

I figured the overhead of loading the thing into memory and then spitting it out was unecessary so I wrote a variant that opens the files first and then writes to the apropriate one for each line. Trouble was it wasnt faster. :( So after a bit of tweaking and some less than pretty (imo) code I managed to soup it up to be the same speed or just slightly faster than your concatenation version. (Unfortunately in the interest of fairness I applied those optimizations that i could to your variant and once done its faster than my version. :-)

Incidentally I used a dictionary of 320k words, ranging from 2 to 29 chars. Im pretty sure that under some memory sizes the approach I took would be more appropriate than the other two (and vice versa that on some OS's the approach would be impractical due to restrictions on the number of filehandles allowed), but I was suprised that it wasnt faster to start off with. It seems the overhead of switching between filehandles is higher than I thought. _<super>

Benchmark: timing 10 iterations of ch0as, demerphq, gmax, gmax_dem...
     ch0as: 45 wallclock secs (31.86 usr +  1.86 sys = 33.72 CPU) @  0.30/s (n=10)
  demerphq: 23 wallclock secs (19.05 usr +  2.26 sys = 21.31 CPU) @  0.47/s (n=10)
      gmax: 27 wallclock secs (21.03 usr +  1.94 sys = 22.97 CPU) @  0.44/s (n=10)
  gmax_dem: 25 wallclock secs (18.16 usr +  1.61 sys = 19.77 CPU) @  0.51/s (n=10)
         s/iter    ch0as     gmax demerphq gmax_dem
ch0as      3.37       --     -32%     -37%     -41%
gmax       2.30      47%       --      -7%     -14%
demerphq   2.13      58%       8%       --      -7%
gmax_dem   1.98      71%      16%       8%       --

</super>

ch0as

BTW: I just completed a crossword/scrabble solving engine in perl. One of the avenues that I pursued was storing words based on their length. This strategy turned out to be inappropriate. I ended up using a custom hashing algorithm where the hashed value had semantic meaning about the words it matched. This enabled me to rule out 10s-1000s of words with one binary and. (I ended up getting a turn of a scrabble game down to around 2-3 secs.) If anybodys interested in it let me know.

Heres the code I benchmarked:

#!/usr/bin/perl -w
use strict;
use Benchmark 'cmpthese';

sub demerphq {
    open my $IN,"words.txt" or die "words.txt:$!";
    my $outf;
    my @fh;
    my $nl_len=length("\n");
    foreach (1..35) {
        open $fh[$_+$nl_len],sprintf(">./len/%02d.demerphq",$_) or die
+ "$_.demerphq:$!";
    }
    print {$fh[length $_]} $_ while <$IN>;
    $_ && close $_ foreach @fh;
    foreach (1..35) {
        unlink sprintf("./len/%02d.demerphq",$_) unless -s sprintf("./
+len/%02d.demerphq",$_);
    }
    close $IN;
}

sub ch0as {
    my %Word;
    my $File;
    open my $IN,"words.txt" or die "words.txt:$!";
    push @{$Word{length$_}},$_ while (<$IN>);
    for (keys %Word) {
         $File=sprintf("./len/%02d.ch0as",$_-1);
         open my $OUT,">$File" or die "Cannot open $File: $!\n";
         print $OUT @{$Word{$_}};
         close $OUT;
    };
}

sub gmax {
    my %Word;
    my $File;
    open my $IN,"words.txt" or die "words.txt:$!";
    $Word{length$_} .= $_ while (<$IN>);
    for (keys %Word) {
         $File=sprintf("./len/%02d.gmax",$_-1);
         open my $OUT,">$File" or die "Cannot open $File: $!\n";
         print $OUT $Word{$_};
         close $OUT;
    };
}

sub gmax_dem {
    my @Word;
    my $File;
    open my $IN,"words.txt" or die "words.txt:$!";
    $Word[length$_] .= $_ while (<$IN>);
    for (0..$#Word)    {
        next unless $Word[$_];
         $File=sprintf("./len/%02d.gmax.dem",$_-1);
         open my $OUT,">$File" or die "Cannot open $File: $!\n";
         print $OUT $Word[$_];
         close $OUT;
    };
}


cmpthese(10,{
               demerphq => \&demerphq,
                gmax     => \&gmax,
                gmax_dem => \&gmax_dem,
                ch0as    => \&ch0as
            });
[download]

Yves / DeMerphq
---
Writing a good benchmark isnt as easy as it might look.

[reply]
[d/l]