in reply to Extreme Example of TMTOWTDI
Care to test this one aswell ?
I had the following results (25143 words):
ChOas@xxx$ time ./yours words real 0m1.21s user 0m1.01s sys 0m0.07s ChOas@xxx$ time ./mine words real 0m0.68s user 0m0.54s sys 0m0.05s ChOas@xxx$
This is the code:
#!/usr/bin/perl -w use strict; my %Word; my $File; # Fill the hash/list here, ignoring the chomp push @{$Word{length$_}},$_ while (<>); # Itterate over the different lenghts of words for (keys %Word) { # This is what is called a dirty hack, subtract # one from the length of the string to compensate # for the '\n' (but hey, it saves a chomp for every word :) ) $File=$_-1 . ".mine"; # Write the file open OUT,">$File" or die "Cannot open $File.mine: $!\n"; print OUT @{$Word{$_}}; close OUT; };
p.s. You REALLY want to run/develop code with use strict; and -w...
GreetZ!,
- ChOas
print "profeth still\n" if /bird|devil/;
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: (FASTER) Example of TMTOWTDI
by gmax (Abbot) on Mar 22, 2002 at 10:01 UTC | |
Your script saves time by skipping chomp. My addition will save time by avoiding a hidden join for each array interpolation. I found this trick when I was benchmarking anagram algorithms. I used a 116_246 words dictionary, and I got a significative speed improvement, as you can see from the relative times. Change into
BTW: ChOas, congratulations for your sanctification! update There is always room for improvement, though. ... Combining the first array solution with the other tricks, we have a further 10% improvement. update (2) I am afraid that I must have spoiled demerphq's comparison with my previous update. ;) He very chivalrously included such update and compared the methods shown so far, offering an improvement, that was unfortunately slower than this array + concatenation hack. Nice Job. Keep up the good work, bro, and many thanks for your analysis. _ _ _ _ (_|| | |(_|>< _| | [reply] [d/l] [select] |
by demerphq (Chancellor) on Mar 22, 2002 at 11:53 UTC | |
I figured the overhead of loading the thing into memory and then spitting it out was unecessary so I wrote a variant that opens the files first and then writes to the apropriate one for each line. Trouble was it wasnt faster. :( So after a bit of tweaking and some less than pretty (imo) code I managed to soup it up to be the same speed or just slightly faster than your concatenation version. (Unfortunately in the interest of fairness I applied those optimizations that i could to your variant and once done its faster than my version. :-) Incidentally I used a dictionary of 320k words, ranging from 2 to 29 chars. Im pretty sure that under some memory sizes the approach I took would be more appropriate than the other two (and vice versa that on some OS's the approach would be impractical due to restrictions on the number of filehandles allowed), but I was suprised that it wasnt faster to start off with. It seems the overhead of switching between filehandles is higher than I thought.
<super> BTW: I just completed a crossword/scrabble solving engine in perl. One of the avenues that I pursued was storing words based on their length. This strategy turned out to be inappropriate. I ended up using a custom hashing algorithm where the hashed value had semantic meaning about the words it matched. This enabled me to rule out 10s-1000s of words with one binary and. (I ended up getting a turn of a scrabble game down to around 2-3 secs.) If anybodys interested in it let me know. Heres the code I benchmarked:
Yves / DeMerphq | [reply] [d/l] |
Re: Re: Extreme Example of TMTOWTDI
by Cody Pendant (Prior) on Mar 23, 2002 at 07:18 UTC | |
I have to admit I don't really get that one. I'm realising there's a lot I don't know about hashes of arrays and arrays of hashes. My preferred solution to the problem would have gone like this (pseudocode!):
but, as I'm sure you all know, I couldn't do that.
Was I even close? What's the hash way to do this?
You REALLY want to run/develop code with use strict; and -w I do most of the time, really. Honestly. No, I do. Sometimes anyway. I have got "or die $!" on my file opens at least, I'm getting better.
--
| [reply] [d/l] [select] |
by Dice (Beadle) on Mar 24, 2002 at 18:12 UTC | |
Here's something else you can consider -- anonymous open filehandles aggregated and managed within a container. Huh? Check it out... I created 2 programs. make_words.pl that creates some number of words (well, string of "a" of length between 1 and 10) and file_words.pl that puts them into files named LENGTH.words. I run them together, with: ./make_words.pl 100 | ./file_words.pl (The 100 is the number of "words" I want to generate.) make_words.pl
#!/usr/bin/perl -w use strict; my $len; foreach ( 1 .. $ARGV[0] ) { my $len = (int(10 * rand())) + 1; print (('a' x $len), "\n"); } exit 0; file_words.pl
#!/usr/bin/perl -w use strict; # use Data::Dumper; my @filehandles = (); while ( my $word = <> ) { my $len = length($word) - 1; # print Dumper(\@filehandles); unless ( $filehandles[$len] ) { open($filehandles[$len], "> $len.words") or do { warn "$len.words already exists!\n"; next; }; } print {$filehandles[$len]} $word; } foreach ( @filehandles ) { close($_) if $_; } exit 0; make_words.pl is too simple for analysis. (Right?) In file_words.pl, I create an array that is meant to store my anonymous filehandles. Of coure, there isn't anything in it at the begining of the program. Looping through all the words, I determine the length of the word in question. (Why bother chomp-ing? We'll use the newline later). Then, I want to write the word to the file. The next thing to do is to open an appropriately-named filehandle for writing... unless there already is an appropriate filehandle. In which case, just print the word to that filehandle. Notice that I'm using a scalar as my filehandle. Perl will autovivify an anonymous filehandle and assign it to that scalar, assuming it's an assignable scalar... such as what can be found in an array element. (The array is my aggregator -- I aggregate [collect] my filehandles within it.) At the end of the program, I go through my array and close any filehandles that are stored within it. I don't just want to close every element in the array, as some of them might be "undef" values. Note that if you uncomment the 2 commented lines, you'll get some data-dumper output that shows you the gradual population of elements within the @filehandles array with anon filehandles. Note that older version of Perl 5 won't support this kind of filehandle autovivification of empty assignable scalars. (In which case, you can still use this technique, but with minor modifications. But ask me about this later if you like.) Here's a bit of a screencapture of the whole procedure... rdice@tanru:~/tmp/test$ rm *.words; ./make_words.pl 100 | ./file_test.pl rdice@tanru:~/tmp/test$ wc -l *.words 8 1.words 11 10.words 8 2.words 15 3.words 7 4.words 11 5.words 10 6.words 9 7.words 10 8.words 11 9.words 100 total
Cheers, | [reply] |
by demerphq (Chancellor) on Mar 25, 2002 at 12:03 UTC | |
The reason seems to be twofold. First the reading loop has a higher overhead because it needs to do an extra operation per line read. Second it seems that there is an overhead associated with swapping between cached filehandles. Before my solution was competetive (and it started life looking much the same as your solution) I had to replace the open on demand, with a preopen for each file (which obviously is only feasible in some situations). Anyway, nice explanation of the technique.
Yves / DeMerphq | [reply] |