Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Extreme Example of TMTOWTDI

by Cody Pendant (Prior)
on Mar 22, 2002 at 08:40 UTC ( #153514=perlmeditation: print w/replies, xml ) Need Help??

Fellow monks,
I had a task to do in Perl yesterday, and it was a simple one.

I had a 25,000-word cracking dictionary file I downloaded, and I wanted to split it up by word length (My motivation is to do with cryptic crosswords).

I just wanted to put all the 10-letter words into a file called 10.words and so on.

So I wrote it the following very dumb way. I knew it was dumb, but I knew it would work.

while(<>){ chomp($_); $len = length($_); open(OUTPUT,">>${len}.words"); print OUTPUT "$_\n"; close(OUTPUT); }

And you've got to admit, it was quick to code.

Then, tortured by how dumb it was, I came back to the problem and figured something else out -- once I'd checked that I could have an array called "@1" and an array called "@2" and so on -- for a while I was convinced that was illegal, though I can't say why.

This is the better way:

while(<>){ chomp($_); $len = length($_); if($longest < $len){ $longest = $len; } push(@{$len},$_); } $"="\n"; for($x=1;$x<=$longest;$x++){ open(WORDS,">$x.txt") || die "$!"; print WORDS "@${x}"; close(WORDS); }

Who wants to guess how much smarter the second was than the first?

The first way took 212 seconds (on a computer with a 333MHz chip) and the second took two seconds.

I just thought you might be interested, or have comments.

($_='jjjuuusssttt annootthhrer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: Extreme Example of TMTOWTDI
by ChOas (Curate) on Mar 22, 2002 at 09:13 UTC

    Care to test this one aswell ?

    I had the following results (25143 words):
    ChOas@xxx$ time ./yours words real 0m1.21s user 0m1.01s sys 0m0.07s ChOas@xxx$ time ./mine words real 0m0.68s user 0m0.54s sys 0m0.05s ChOas@xxx$

    This is the code:
    #!/usr/bin/perl -w use strict; my %Word; my $File; # Fill the hash/list here, ignoring the chomp push @{$Word{length$_}},$_ while (<>); # Itterate over the different lenghts of words for (keys %Word) { # This is what is called a dirty hack, subtract # one from the length of the string to compensate # for the '\n' (but hey, it saves a chomp for every word :) ) $File=$_-1 . ".mine"; # Write the file open OUT,">$File" or die "Cannot open $File.mine: $!\n"; print OUT @{$Word{$_}}; close OUT; };

    p.s. You REALLY want to run/develop code with use strict; and -w...


    print "profeth still\n" if /bird|devil/;
      Maybe it's an even dirtier hack, but this one can speed up things a lot.
      Your script saves time by skipping chomp. My addition will save time by avoiding a hidden join for each array interpolation. I found this trick when I was benchmarking anagram algorithms.
      I used a 116_246 words dictionary, and I got a significative speed improvement, as you can see from the relative times.
      $ wc words.english 116246 116246 989075 words.english $ time perl words.english # first hack 1.38 user 0.05 system 0:01.43 elapsed $ time perl words.english # second hack 0.86 user 0.02 system 0:00.88 elapsed
      push @{$Word{length$_}},$_ while (<>); # ... print OUT @{$Word{$_}};
      $Word{length$_} .= $_ while (<>); # ... print OUT $Word{$_};

      BTW: ChOas, congratulations for your sanctification!

      There is always room for improvement, though. ...
      Combining the first array solution with the other tricks, we have a further 10% improvement.

      #!/usr/bin/perl -w use strict; my @words; my $file; $words[length$_] .= $_ while (<>); for (0..$#words) { next unless $words[$_]; $file=$_-1 . ".mine"; open OUT,">$file" or die "Cannot open $file.mine: $!\n"; print OUT $words[$_]; close OUT; };
      update (2)
      I am afraid that I must have spoiled demerphq's comparison with my previous update. ;)
      He very chivalrously included such update and compared the methods shown so far, offering an improvement, that was unfortunately slower than this array + concatenation hack.
      Nice Job. Keep up the good work, bro, and many thanks for your analysis.
       _  _ _  _  
      (_|| | |(_|><
        Hmm. Well I have to admit this thread was a bit of a suprise.

        I figured the overhead of loading the thing into memory and then spitting it out was unecessary so I wrote a variant that opens the files first and then writes to the apropriate one for each line. Trouble was it wasnt faster. :( So after a bit of tweaking and some less than pretty (imo) code I managed to soup it up to be the same speed or just slightly faster than your concatenation version. (Unfortunately in the interest of fairness I applied those optimizations that i could to your variant and once done its faster than my version. :-)

        Incidentally I used a dictionary of 320k words, ranging from 2 to 29 chars. Im pretty sure that under some memory sizes the approach I took would be more appropriate than the other two (and vice versa that on some OS's the approach would be impractical due to restrictions on the number of filehandles allowed), but I was suprised that it wasnt faster to start off with. It seems the overhead of switching between filehandles is higher than I thought. <super>

        Benchmark: timing 10 iterations of ch0as, demerphq, gmax, gmax_dem...
             ch0as: 45 wallclock secs (31.86 usr +  1.86 sys = 33.72 CPU) @  0.30/s (n=10)
          demerphq: 23 wallclock secs (19.05 usr +  2.26 sys = 21.31 CPU) @  0.47/s (n=10)
              gmax: 27 wallclock secs (21.03 usr +  1.94 sys = 22.97 CPU) @  0.44/s (n=10)
          gmax_dem: 25 wallclock secs (18.16 usr +  1.61 sys = 19.77 CPU) @  0.51/s (n=10)
                 s/iter    ch0as     gmax demerphq gmax_dem
        ch0as      3.37       --     -32%     -37%     -41%
        gmax       2.30      47%       --      -7%     -14%
        demerphq   2.13      58%       8%       --      -7%
        gmax_dem   1.98      71%      16%       8%       --
        </super> Oh and I made some slight modifications to your and ch0as's snippets, such as not using global filehandles and using a better formatted filename. These are consistent across the variants so they shouldnt have compromised the benchmark in any way.

        BTW: I just completed a crossword/scrabble solving engine in perl. One of the avenues that I pursued was storing words based on their length. This strategy turned out to be inappropriate. I ended up using a custom hashing algorithm where the hashed value had semantic meaning about the words it matched. This enabled me to rule out 10s-1000s of words with one binary and. (I ended up getting a turn of a scrabble game down to around 2-3 secs.) If anybodys interested in it let me know.

        Heres the code I benchmarked:

        #!/usr/bin/perl -w use strict; use Benchmark 'cmpthese'; sub demerphq { open my $IN,"words.txt" or die "words.txt:$!"; my $outf; my @fh; my $nl_len=length("\n"); foreach (1..35) { open $fh[$_+$nl_len],sprintf(">./len/%02d.demerphq",$_) or die + "$_.demerphq:$!"; } print {$fh[length $_]} $_ while <$IN>; $_ && close $_ foreach @fh; foreach (1..35) { unlink sprintf("./len/%02d.demerphq",$_) unless -s sprintf("./ +len/%02d.demerphq",$_); } close $IN; } sub ch0as { my %Word; my $File; open my $IN,"words.txt" or die "words.txt:$!"; push @{$Word{length$_}},$_ while (<$IN>); for (keys %Word) { $File=sprintf("./len/%02d.ch0as",$_-1); open my $OUT,">$File" or die "Cannot open $File: $!\n"; print $OUT @{$Word{$_}}; close $OUT; }; } sub gmax { my %Word; my $File; open my $IN,"words.txt" or die "words.txt:$!"; $Word{length$_} .= $_ while (<$IN>); for (keys %Word) { $File=sprintf("./len/%02d.gmax",$_-1); open my $OUT,">$File" or die "Cannot open $File: $!\n"; print $OUT $Word{$_}; close $OUT; }; } sub gmax_dem { my @Word; my $File; open my $IN,"words.txt" or die "words.txt:$!"; $Word[length$_] .= $_ while (<$IN>); for (0..$#Word) { next unless $Word[$_]; $File=sprintf("./len/%02d.gmax.dem",$_-1); open my $OUT,">$File" or die "Cannot open $File: $!\n"; print $OUT $Word[$_]; close $OUT; }; } cmpthese(10,{ demerphq => \&demerphq, gmax => \&gmax, gmax_dem => \&gmax_dem, ch0as => \&ch0as });

        Yves / DeMerphq
        Writing a good benchmark isnt as easy as it might look.

      push @{$Word{length$_}},$_ while (<>);

      I have to admit I don't really get that one. I'm realising there's a lot I don't know about hashes of arrays and arrays of hashes.

      My preferred solution to the problem would have gone like this (pseudocode!):

      $len = length($_); #let's say $len is 5 $hash{$len}=[] unless defined $hash{$len}; # make a hash key called "5" with an empty array as the value push($hash{$len},$_); #use it as an array and push this five-letter word onto it.

      but, as I'm sure you all know, I couldn't do that.

      Was I even close? What's the hash way to do this?

      You REALLY want to run/develop code with use strict; and -w

      I do most of the time, really. Honestly. No, I do. Sometimes anyway. I have got "or die $!" on my file opens at least, I'm getting better.


      ($_='jjjuuusssttt annootthhrer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
        As has been pointed out before, the fundamental cause of the slowness of your orginal code is that you are doing an open/write/close operation for each input line. That's no good -- open and close are slow things to do.

        Here's something else you can consider -- anonymous open filehandles aggregated and managed within a container.


        Check it out...

        I created 2 programs. that creates some number of words (well, string of "a" of length between 1 and 10) and that puts them into files named LENGTH.words. I run them together, with: ./ 100 | ./ (The 100 is the number of "words" I want to generate.)

        #!/usr/bin/perl -w
        use strict;
        my $len;
        foreach ( 1 .. $ARGV[0] ) {
            my $len = (int(10 * rand())) + 1;
            print (('a' x $len), "\n");
        exit 0;

        #!/usr/bin/perl -w
        use strict;
        # use Data::Dumper;
        my @filehandles = ();
        while ( my $word = <> ) {
            my $len = length($word) - 1;
        #    print Dumper(\@filehandles);
            unless ( $filehandles[$len] ) {
        	open($filehandles[$len], "> $len.words") or do {
        	    warn "$len.words already exists!\n";
            print {$filehandles[$len]} $word;
        foreach ( @filehandles ) {
            close($_) if $_;
        exit 0;
         is too simple for analysis. (Right?)

        In, I create an array that is meant to store my anonymous filehandles. Of coure, there isn't anything in it at the begining of the program.

        Looping through all the words, I determine the length of the word in question. (Why bother chomp-ing? We'll use the newline later).

        Then, I want to write the word to the file. The next thing to do is to open an appropriately-named filehandle for writing... unless there already is an appropriate filehandle. In which case, just print the word to that filehandle.

        Notice that I'm using a scalar as my filehandle. Perl will autovivify an anonymous filehandle and assign it to that scalar, assuming it's an assignable scalar... such as what can be found in an array element. (The array is my aggregator -- I aggregate [collect] my filehandles within it.)

        At the end of the program, I go through my array and close any filehandles that are stored within it. I don't just want to close every element in the array, as some of them might be "undef" values.

        Note that if you uncomment the 2 commented lines, you'll get some data-dumper output that shows you the gradual population of elements within the @filehandles array with anon filehandles.

        Note that older version of Perl 5 won't support this kind of filehandle autovivification of empty assignable scalars. (In which case, you can still use this technique, but with minor modifications. But ask me about this later if you like.)

        Here's a bit of a screencapture of the whole procedure...

        rdice@tanru:~/tmp/test$ rm *.words; ./ 100 | ./
        rdice@tanru:~/tmp/test$ wc -l *.words
              8 1.words
             11 10.words
              8 2.words
             15 3.words
              7 4.words
             11 5.words
             10 6.words
              9 7.words
             10 8.words
             11 9.words
            100 total


Re: Extreme Example of TMTOWTDI
by Juerd (Abbot) on Mar 22, 2002 at 11:53 UTC

    When using external files, think about your operating system's caching. For example:

    2;0 juerd@ouranos:/usr/share/dict$ du -ch | grep total 52M total 2;0 juerd@ouranos:/usr/share/dict$ time cat * > /dev/null real 0m4.382s user 0m0.000s sys 0m0.260s 2;0 juerd@ouranos:/usr/share/dict$ time cat * > /dev/null real 0m0.255s user 0m0.010s sys 0m0.230s
    So always try your first solution AGAIN when you tried your second.


Re: Extreme Example of TMTOWTDI
by dragonchild (Archbishop) on Mar 22, 2002 at 14:29 UTC
    I just have to be the anal saint and chastise you for using symbolic references. *grins* Consider yourself chastised. :-)

    Neat thread. ++! :-)

    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: Extreme Example of TMTOWTDI
by sporte01 (Acolyte) on Mar 22, 2002 at 21:55 UTC
    Here's a simple explination onw hat you are doing. Everytime you find a word of a specific length, you pretty much open-write-close.

    It's an expensive thing to open a file, not so expensive you should never open a file, but its expensive. You don't wanna do it too often. In your first loop, you are opening/closing x times. x = lines in text file you are reading from.

    In the second one, you pretty much read all the data into memory and then open/close y times. y = number of different lengths.

    SO now its a matter of ~25000 file open/close vs max word length times (~20? ~50?)

Re: Extreme Example of TMTOWTDI
by Anonymous Monk on Mar 22, 2002 at 15:53 UTC
    You'd get the same speed up by doing this:
    open(OUTPUT,">>${len}.words"); while(<>){ chomp($_); $len = length($_); print OUTPUT "$_\n"; } close(OUTPUT);
    Might even be faster since there is no array house cleaning. But at 2 seconds, I imagine that it takes longer to type the command then run anyway.
      open(OUTPUT,">>${len}.words"); while(<>){ chomp($_); $len = length($_); print OUTPUT "$_\n"; } close(OUTPUT);
      This doesn't do what the others do. For one, $len isn't defined before the open. Secondly, you don't have an outer loop involved to be able to re-open the files as needed.

      Secondly, if you're going to use $_, then use it. Don't piddle around. If you're going to use a named variable (and I'm not saying this is bad), then name it. Otherwise, take advantage of Perl-isms.

      while (<>) { chomp; my $len = length; print OUTPUT $_, $/; }
      It wouldn't be nice unless I posted my own (unBenchmarked) version.
      use IO::File; my %Handles; while (<>) { my $handle = $Handles{length} ||= IO::File->new(">" . length - 1); print $handle $_; } # This is unnecessary unless you're anal (like I try to be) $_->close for values %Handles;

      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://153514]
Approved by root
[oiskuu]: Hm. that depends on kernel compiled with CONFIG_AUDITSYSCAL L
[tye]: apropos my several mentions of audit
[Corion]: At least my (non-SELinux) Debian has that config thing set. I don't have non-Debian machines handy (except Android)
[Corion]: My Android phone also has /proc/self/ loginuid, but that displays -1 (resp. 4GB). That might be because the phone is rooted.
[tye]: -1 means nobody logged in or the process was started before audit got booted
[davido]: ok, on my ubuntu system getlogin grabs from /proc/self/ loginuid (per strace)

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2017-06-23 19:56 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (554 votes). Check out past polls.