Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: Re: Extreme Example of TMTOWTDI

by Cody Pendant (Prior)
on Mar 23, 2002 at 07:18 UTC ( #153745=note: print w/replies, xml ) Need Help??

in reply to Re: Extreme Example of TMTOWTDI
in thread Extreme Example of TMTOWTDI

push @{$Word{length$_}},$_ while (<>);

I have to admit I don't really get that one. I'm realising there's a lot I don't know about hashes of arrays and arrays of hashes.

My preferred solution to the problem would have gone like this (pseudocode!):

$len = length($_); #let's say $len is 5 $hash{$len}=[] unless defined $hash{$len}; # make a hash key called "5" with an empty array as the value push($hash{$len},$_); #use it as an array and push this five-letter word onto it.

but, as I'm sure you all know, I couldn't do that.

Was I even close? What's the hash way to do this?

You REALLY want to run/develop code with use strict; and -w

I do most of the time, really. Honestly. No, I do. Sometimes anyway. I have got "or die $!" on my file opens at least, I'm getting better.


($_='jjjuuusssttt annootthhrer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: Re: Re: Extreme Example of TMTOWTDI
by Dice (Beadle) on Mar 24, 2002 at 18:12 UTC
    As has been pointed out before, the fundamental cause of the slowness of your orginal code is that you are doing an open/write/close operation for each input line. That's no good -- open and close are slow things to do.

    Here's something else you can consider -- anonymous open filehandles aggregated and managed within a container.


    Check it out...

    I created 2 programs. that creates some number of words (well, string of "a" of length between 1 and 10) and that puts them into files named LENGTH.words. I run them together, with: ./ 100 | ./ (The 100 is the number of "words" I want to generate.)

    #!/usr/bin/perl -w
    use strict;
    my $len;
    foreach ( 1 .. $ARGV[0] ) {
        my $len = (int(10 * rand())) + 1;
        print (('a' x $len), "\n");
    exit 0;

    #!/usr/bin/perl -w
    use strict;
    # use Data::Dumper;
    my @filehandles = ();
    while ( my $word = <> ) {
        my $len = length($word) - 1;
    #    print Dumper(\@filehandles);
        unless ( $filehandles[$len] ) {
    	open($filehandles[$len], "> $len.words") or do {
    	    warn "$len.words already exists!\n";
        print {$filehandles[$len]} $word;
    foreach ( @filehandles ) {
        close($_) if $_;
    exit 0;
 is too simple for analysis. (Right?)

    In, I create an array that is meant to store my anonymous filehandles. Of coure, there isn't anything in it at the begining of the program.

    Looping through all the words, I determine the length of the word in question. (Why bother chomp-ing? We'll use the newline later).

    Then, I want to write the word to the file. The next thing to do is to open an appropriately-named filehandle for writing... unless there already is an appropriate filehandle. In which case, just print the word to that filehandle.

    Notice that I'm using a scalar as my filehandle. Perl will autovivify an anonymous filehandle and assign it to that scalar, assuming it's an assignable scalar... such as what can be found in an array element. (The array is my aggregator -- I aggregate [collect] my filehandles within it.)

    At the end of the program, I go through my array and close any filehandles that are stored within it. I don't just want to close every element in the array, as some of them might be "undef" values.

    Note that if you uncomment the 2 commented lines, you'll get some data-dumper output that shows you the gradual population of elements within the @filehandles array with anon filehandles.

    Note that older version of Perl 5 won't support this kind of filehandle autovivification of empty assignable scalars. (In which case, you can still use this technique, but with minor modifications. But ask me about this later if you like.)

    Here's a bit of a screencapture of the whole procedure...

    rdice@tanru:~/tmp/test$ rm *.words; ./ 100 | ./
    rdice@tanru:~/tmp/test$ wc -l *.words
          8 1.words
         11 10.words
          8 2.words
         15 3.words
          7 4.words
         11 5.words
         10 6.words
          9 7.words
         10 8.words
         11 9.words
        100 total


      Actually the create a filehandle on demand is not as efficient as building the array (on any box with a reasonable amount of memory that is.) If you have a look at my earlier comment this is exactly the technique that I used in my solution, and it required some counter-intuitive tweaking before it became competetive with the array building technique.

      The reason seems to be twofold. First the reading loop has a higher overhead because it needs to do an extra operation per line read. Second it seems that there is an overhead associated with swapping between cached filehandles. Before my solution was competetive (and it started life looking much the same as your solution) I had to replace the open on demand, with a preopen for each file (which obviously is only feasible in some situations).

      Anyway, nice explanation of the technique.

      Yves / DeMerphq
      Writing a good benchmark isnt as easy as it might look.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://153745]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2020-02-26 23:17 GMT
Find Nodes?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?

    Results (117 votes). Check out past polls.