in reply to Perl's pearls
I'd do the duplicate elimination in the outer loop:
while (<DATA>) {
chomp;
my $key = lc $_;
next if $seen{$key}++;
my $signature = join "", sort split //, $key;
push @{$words{$signature}}, $_;
}
for (sort keys %words) {
my @list = sort @{$words{$_}};
next unless @list > 1;
print "@list\n";
}
__END__
Abby
abbot
acne
alert
alter
baby
Baby
BABY
best
bets
cane
later
Unix
UNIX
-- Randal L. Schwartz, Perl hacker
Re: Re: Perl's pearls
by gmax (Abbot) on Dec 31, 2001 at 21:48 UTC
|
Your script is very elegant, shorter and functionally equivalent.
However, this simplicity comes at a price. The overhead of a second hash to eliminate duplicates is a heavy load on performance, especially after adding "use strict" and "-w" on top.
I compared the execution times on a 102_759 words dictionary.
$ time perl gmax_anagram.pl < words |wc
4.26user 0.07system 0:04.32elapsed
6427 14802 107910
$ time perl merlyn_anagram.pl < words |wc
6.35user 0.10system 0:06.45elapsed
6427 14802 107910
(after adding use strict and -w)
$ time perl merlyn_anagram.pl < words |wc
7.44user 0.12system 0:07.55elapsed
6427 14802 107910
Anyway, this approach is one more thing to know, if I want to privilege a shorter script in favor of a faster one.
Thanks.
_ _ _ _
(_|| | |(_|><
_|
| [reply] [d/l] |
|
OK, let's get rid of that big hash, if you think that's where the trouble is:
WORD: while (<DATA>) {
chomp;
my $key = lc $_;
my $signature = join "", sort split //, $key;
my $list = $words{$signature} ||= [];
lc $_ eq $key and next WORD for @$list;
push @$list, $_;
}
for (sort keys %words) {
my @list = sort @{$words{$_}};
next unless @list > 1;
print "@list\n";
}
__END__
Abby
abbot
acne
alert
alter
baby
Baby
BABY
best
bets
cane
later
Unix
UNIX
How's that do?
-- Randal L. Schwartz, Perl hacker | [reply] [d/l] |
|
I have to confess that, for my relatively young Perl experience, this second example is not as easy to read as the first one.
Anyway, this second script is even slower, due to the fact that we are accessing the hash more than needed. Access during the input phase is minimal, but during the output phase the hash is used twice (once to extract the keys and once to get its values (@list = sort @{$words{$_}}). It sums up to scanning a 85_000 items hash twice.
Moreover, this script is sorting the complete hash, instead of doing it after filtering out only the items with anagrams.
As an example, if we replace the second block in the script with this line
print map {"@$_\n" } sort {$a->[0] cmp $b->[0]} grep { @$_ > 1}
values %words;
it becomes almost 20% faster.
Further difference in speed should be blamed on the efficiency of strings vs arrays (more on this topic later).
My school of programming is quite pragmatic. Since I usually work with large chunks of data that I get from databases, I learned how to minimize the heavy loads in a program.
In this particular case, I know that we are dealing with a potentially huge hash. Every unnecessary access to such structure makes the program slower.
I want to find an acceptable compromise between readability and efficiency. Maximum efficiency could be achieved at the price of readability, and maximum readability could challenge efficiency.
About readability, while there are style principles that might give a common base, reading advanced programs requires some advanced knowledge. Therefore, readability is subjective, and it is a blend of language knowledge and style principles.
Back to business, I made a new version of my script, modified for speed. No warnings, no strict and no declarations (but I tried it with eveything enabled before presenting it here). I think it is easily readable, except maybe the last line, for which I made an explanation in the main node (remembering my first days with Perl).
while (<>) {
chomp;
$_ = lc $_;
$signature = pack "C*", sort unpack "C*", $_;
if (exists $words{$signature}) {
next if $words{$signature} =~ /\b$_\b/;
$words{$signature} .= " ";
}
$words{$signature} .= $_;
}
print join "\n", sort grep {tr/ //} values %words;
print "\n";
This script touches the hash three times directly plus two times conditionally. The first access is made with the exists function. If this test is true, two more accesses to the hash are performed, but only to those items that have anagrams or duplicates. In our case, about 15% of the items). Then we access the hash to insert the words and to get the results. Only once.
It runs under 4 seconds for those 100_000 words that I collected, while merlyn's second script runs in 6.7 seconds.
I don't want to start a competition with anybody (especially not merlyn, whom I admire and respect,) but I would just like to point out that my script, more than a matter of taste, is the result of some research on efficiency issues, as I have already stated in my main node.
I benchmarked the resource consuming parts of this short script, and my choice of pack vs split and strings vs arrays is due to the timing of the relative performance.
In particular, I went to benchmark extensively the performance of hashes of strings vs hashes of arrays.
There are three operations that affect this data structure in our anagrams application
1. append an item at the end;
2. count how many items in my array or string;
3. fetch all the items at once (string interpolation);
In two of these operations, (1 and 3) strings are faster. If my application only needs operations 1 to 3 (ie with no access to the items individually), strings are still faster, since the speed for insertion and fetching compensates for the slower counting. Arrays are faster only if I want to access items one by one.
An explanation is necessary for the slower performance of arrays in string interpolation.
my @array = qw(one two three);
print @array;
# output : 'onetwothree'
# it's the same as foreach(@array) {print $_}
print "@array";
# output : 'one two three'
# it's the same as print join " ", @array;
The above code fragment shows the effects of string interpolation. An array is merged into a string with its items separated by a space. This is standard Perl behavior. This operation is roughly the same as using join on the array explicitly and this fact should account for the slower performance.
For small hashes the difference is almost insignificant, and thus I would prefer to use an array, to have a more clean data structure. In my anagrams script I preferred the strings because I am dealing with potentially huge input.
The following is my benchmarking code that I used to evaluate the relative speed of these structures.
#!/usr/bin/perl -w
use strict;
use Benchmark;
my $iterations = 200_000;
my %with_str; # hash containing strings
my %with_arr; # hash containing arrays
my $strcount = 0; # counter for hash of strings
my $arcount = 0; # counter for hash of arrays
my ($constant1, $constant2) =
("abcd", "dcba"); # strings used to fill the items
timethese ($iterations, # inserts two elements per each hash value
{
"insert string" => sub {
$with_str{$strcount} .= "$constant1$strcount";
$with_str{$strcount++} .= " $constant2$strcount"
},
"push array" => sub {
push @{$with_arr{$arcount}}, "$constant1$arcount";
push @{$with_arr{$arcount++}}, "$constant2$arcount"
}
});
my $count = 0;
$arcount = 0;
$strcount = 0;
timethese ($iterations, # counts items for each hash value
{
"count string items" => sub {
$count = $with_str{$strcount++} =~ tr/ //;
},
"count array items" => sub {
$count = scalar @{$with_arr{$arcount++}}
}
});
$arcount = 0;
$strcount = 0;
my $output = "";
timethese ($iterations, # string interpolation
{
"fetch string" => sub {
$output = "$with_str{$strcount++}"
},
"fetch array" => sub {
$output = "@{$with_arr{$arcount++}}"
}
});
$count = 0;
$arcount = 0;
$strcount = 0;
timethese ($iterations, # access separate items
{
"items from string" => sub {
foreach (split / /, $with_str{$strcount}) {
$output = $_;
}
$strcount++;
},
"items from array" => sub {
foreach ( @{$with_arr{$arcount}}) {
$output = $_;
}
$arcount++;
}
});
=pod
Benchmark:
timing 200000 iterations of insert string, push array...
insert string: 3 wallclock secs ( 1.92 usr + 0.14 sys = 2.06 C
+PU)
push array: 3 wallclock secs ( 2.39 usr + 0.15 sys = 2.54 C
+PU)
timing 200000 iterations of count array items, count string items...
count string items: 2 wallclock secs ( 0.83 usr + 0.00 sys = 0.83 C
+PU)
count array items: 0 wallclock secs ( 0.64 usr + 0.00 sys = 0.64 C
+PU)
timing 200000 iterations of fetch array, fetch string...
fetch string: 1 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 C
+PU)
fetch array: 2 wallclock secs ( 1.13 usr + 0.00 sys = 1.13 C
+PU)
timing 200000 iterations of items from array, items from string...
items from string: 2 wallclock secs ( 2.65 usr + 0.07 sys = 2.72 C
+PU)
items from array: 1 wallclock secs ( 2.02 usr + 0.07 sys = 2.09 C
+PU)
totals
(inserting, counting items in each hash value,
and fetching all the values at once)
string: 3.34
array : 4.16
totals
(inserting, counting items,
and fetching items one by one from each hash value)
string: 5.40
array : 5.05
=cut
_ _ _ _
(_|| | |(_|><
_|
| [reply] [d/l] [select] |
|
It seems like the main improvement/optimization would be not looping twice through the list of all words.  Move *all* processing into the main loop:
my (%word, %gram);
while (<>) {
chomp;
# $_ = lc $_;
/[^a-z]/ and next;
my $sig = pack "C*", sort unpack "C*", $_;
if (exists $word{$sig}) {
if (exists $gram{$sig}) {
next if $gram{$sig} =~ /\b$_\b/;
$gram{$sig} .= " $_"; # rare
}
else {
next if $word{$sig} eq $_;
$gram{$sig} = "$word{$sig} $_"; # rare
}
}
else {
$word{$sig} = $_; # mostly
}
}
print join "\n", (sort values %gram), ''; # just output short list
Only the first word of an anagram set is in both lists.
Here's some more finds, mostly from the short OED from hereablest bleats stable tables
adroitly dilatory idolatry
angered derange enraged grandee grenade
ascertain cartesian sectarian
asleep elapse please
aspirant partisan
attentive tentative
auctioned cautioned education
canoe ocean
comedian demoniac
compile polemic
covert vector
danger gander garden
deist diets edits idest sited tides
emits items metis mites smite times
emitter termite
lapse leaps pales peals pleas
nastily saintly
obscurantist subtractions
observe obverse verbose
opt pot top
opts post pots spot stop tops
opus soup
oy yo
petrography typographer
peripatetic precipitate
present repents serpent
presume supreme
resin rinse risen siren
salivated validates
slitting stilting tiltings titlings tlingits
views wives
vowels wolves
woodlark workload
  p | [reply] [d/l] [select] |
|
Brilliant! On my computer, your script is 13% faster than mine, using my 100_000 words list. With the one that you suggested (thanks, BTW) which is more than double, the gain is 23%!
It means that yous solution is more scalable and thus better suitable for this kind of tasks.
Like every "eureka" solution, your improvement looks quite simple,
now that I see it! :-)
Thanks.
_ _ _ _
(_|| | |(_|><
_|
| [reply] |
Re^2: Perl's pearls
by dominick_t (Acolyte) on Oct 27, 2017 at 00:41 UTC
|
Might somebody on this thread be able to walk me through certain aspects of how this code is working? I understand the general principle, but I am missing some details, and I would like to understand it precisely. I have been reading all the documentation that I can, and am figuring it out in bits and pieces, but I need to understand the behavior of this program perfectly, as I want to write a program to accomplish a similar task. | [reply] |
|
$ perl -d pm_135391.pl
Loading DB routines from perl5db.pl version 1.49_001
Editor support available.
Enter h or 'h h' for help, or 'man perldebug' for more help.
main::(pm_135391.pl:1): while (<DATA>) {
DB<1> n
main::(pm_135391.pl:2): chomp;
DB<1> p "<$_>"
<Abby
>
So here, we started the program under the debugger, executed the first statement (loading $_ from <DATA>), then displayed the value using the p command. (I used "<$_>" to put delimiters around the value, so you can see that it has end-of-line characters in it.
DB<2> n
main::(pm_135391.pl:3): my $key = lc $_;
DB<2> n
main::(pm_135391.pl:4): next if $seen{$key}++;
DB<2> n
main::(pm_135391.pl:5): my $signature = join "", sort split //, $key
+;
DB<2> x %seen
0 "abby \cM"
1 1
Here, we executed the next three lines to chomp the value, make the $key value, and add $key to the %seen hash. Then I used the x command to display the %seen hash, so you can see what's in it.
DB<3> n
main::(pm_135391.pl:6): push @{$words{$signature}}, $_;
DB<3> x $signature
0 "\cM abby"
DB<4> h
List/search source lines: Control script execution:
l [ln|sub] List source code T Stack trace
- or . List previous/current line s [expr] Single step [in
+expr]
v [line] View around line n [expr] Next, steps over
+ subs
f filename View source in file <CR/Enter> Repeat last n or
+ s
/pattern/ ?patt? Search forw/backw r Return from subr
+outine
M Show module versions c [ln|sub] Continue until p
+osition
Debugger controls: L List break/watch
+/actions
o [...] Set debugger options t [n] [expr] Toggle trace [m
+ax depth] ][trace expr]
<[<]|{[{]|>[>] [cmd] Do pre/post-prompt b [ln|event|sub] [cnd] Set b
+reakpoint
! [N|pat] Redo a previous command B ln|* Delete a/all bre
+akpoints
H [-num] Display last num commands a [ln] cmd Do cmd before li
+ne
= [a val] Define/list an alias A ln|* Delete a/all act
+ions
h [db_cmd] Get help on command w expr Add a watch expr
+ession
h h Complete help page W expr|* Delete a/all wat
+ch exprs
|[|]db_cmd Send output to pager ![!] syscmd Run cmd in a sub
+process
q or ^D Quit R Attempt a restar
+t
Data Examination: expr Execute perl code, also see: s,n,t expr
x|m expr Evals expr in list context, dumps the result or lists
+ methods.
p expr Print expression (uses script's current package).
S [[!]pat] List subroutine names [not] matching pattern
V [Pk [Vars]] List Variables in Package. Vars can be ~pattern or !
+pattern.
X [Vars] Same as "V current_package [Vars]". i class inherita
+nce tree.
y [n [Vars]] List lexicals in higher scope <n>. Vars same as V.
e Display thread id E Display all thread ids.
For more help, type h cmd_letter, or run man perldebug for all docs.
Then, I decided I showed you enough ;^) and had it print the help page. Have fun with it!
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] [select] |
|
|