Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Dear fellow Monks,
I would like to share with you some notes on software implementation. Some of the technical contents of this node have already been discussed in the Monastery, but I would like to present the issue in a more general context.
In his book "Programming Pearls", Jon Bentley describes an algorithm to find all anagrams in a dictionary. This is an interesting problem, because, until not long ago, CS textbooks used to present anagrams as a task for a brute force recursive attack, close to intractability. Bentley's refreshing view has shown us that several seemingly gigantic chores, when faced from a different angle, become almost routine.
A simplified description of this anagram method is a four steps task.
1. for each word in the dictionary, find an unique "signature" by sorting the word's letters.
2. print a space-separated list of the signature plus the word;
3. sort the list;
4. read into a new list, and put together the words with the same signature (adjacent within the list). They are the anagrams.
5. (not mentioned, but necessary) filter from the list the signatures with more than one word and sort them.

His implementation in "C" uses two programs (one to create the "signatures" and the second one to flatten two or more adjacent entries into a single one) plus the system sort. Even if he doesn't mention it, a third program is also needed, to filter off from the final list the words without anagram.
I wanted to test the implementation of the same algorithm in Perl. It is much simpler, resulting in one script of 15 lines (some Golf implementations can be seen at this node and this one). The external sorting plus the flattening program are replaced by inserting words in a hash, which deals very efficiently with the problem. No external filtering is needed. The final two steps, filtering and printing, are combined into one step only in Perl.
Of course it is not as fast as C, but the complexity of the Perl program is less than one third. The three C programs together amount to 50 lines of code, and the compiled results have to be put together with two calls to the system sort
./sign < dictionary | sort | ./squash | ./filter_anagrams |sort
The Perl script is simpler and more powerful than its C counterparts. If I wanted to implement the same handling of input/output as I have done in my Perl application it wouldn't be trivial. Having had my share of C programming, I know for a fact that doing in C what boils in those 15 lines of Perl can make me learn many new curses. :-)
Back to our anagrams in Perl. The principle is simple: a hash will create the candidates, using the sorted letters of each word as key (the "signature") so that two or more words with the same signature will end up together.
Here is the pseudocode
--------------------------- create an empty words_hash while (input available) read one word windex = sorted lowercase word skip word duplicates add word to words_hash with key = windex foreach windex in words_hash remove unique words # = signatures with one word only sort words_hash foreach item in words_hash print anagrams ----------------------------
example: given this input file
Abby abbot acne alert alter baby Baby BABY best bets cane later Unix UNIX
%words will become (notice that the duplicates in BABY, Baby and UNIX were skipped)
key values ----- --------- abby => abby baby abbot => abbot acen => acne cane best => best bets inux => unix aelrt => alert alter later
and then, skipping the unique items "abbot" and "inux", it will print (only values)
abby baby acne cane best bets alert alter later
Enough introduction. Here is the actual working script
#!/usr/bin/perl -w # usage: < wordlist use strict; my %words = (); # words index (Key: sorte +d # word. Value: word(s)) while (<>) { chomp; # remove newline characte +r $_ = lc $_; # converts to lowercase my $windex = pack "C*", sort unpack "C*", $_; # create index if (exists $words{$windex}) { next if $words{$windex} =~ /\b$_\b/; # skip duplicates $words{$windex} .= " "; # add separating space } $words{$windex} .= $_; # inserts a new word } print map {"$_\n"} sort grep {tr/ //} values %words;

And now, some Perl technical notes
(A) The last line of the script is roughly equivalent to the following
my @anagrams =(); foreach (values %words) { push @anagrams, $_ if tr/ //; # tr returns the number of processed characters, # spaces in this case. Therefore, if spaces are > 0, # it means that there are at least two words, and # we have some anagrams. If space is 0, the word # is unique and we skip it. # grep returns all the items that satisfy the # filtering condition (= are there any spaces?) } # map does the same as this foreach loop foreach (sort @anagrams) { print "$_\n" }
(B) Rejecting duplicates could be only *slightly* faster if implemented with grep
if (exists $words{$windex}) { my $word = $_; next if grep {$_ eq $word} split / /, $words{$windex} ; # skip duplicates $words{$windex} .= " "; # add separating space }
(C) Using an anonymous array as hash value, instead of a space-separated string, makes the program shorter, but (surprisingly) slower.
#!/usr/bin/perl -w use strict; my %words = (); while (<>) { chomp; my $word = lc $_; my $windex = pack "C*", sort unpack "C*", $word; next if (exists $words{$windex}) and grep {$_ eq $word} @{$words{$windex}}; push @{$words{$windex}}, $word; } print map {"@$_\n"} sort grep {scalar @$_ > 1} values %words;
After benchmarking, it seems that adding and fetching a string is less expensive than pushing and fetching (with interpolation) an array, while counting items is faster in arrays. I wouldn't go into more details here, since this node is already a long one.

(D) Creating the word signature with
pack "C*", sort unpack "C*", $_
is faster than using
join "", sort split //, $_;

(E) To make anagrams from words with accented characters (like in French and Italian,) tr should produce a better result than lc
# French and Italian anagrams my $windex = $_; $windex =~ tr/[A-Z]/aeeiou[a-z]/; # gives the same signature + # to accented and # unaccented vowels tr/[A-Z]/[a-z]/; # converts to lowercase $windex = pack "C*", sort unpack "C*", $windex; # create index
To obtain a significant sample, I merged the standard Unix "words" dictionary, the American and British English dictionaries from the GNU ispell package (eg: strings british.hash | perl -ne 'print if /^\w+$/') and removed duplicated words. The total was about 100_000 items. In less than 5 seconds (Linux on a PentiumIII 700) I got a list of more than 14_000 anagrams, from which I report an interesting excerpt:
abolitionism mobilisation acres cares races sacre scare actualises casualties aimless melissa climaxes exclaim collapse escallop colonialist oscillation detains instead sainted stained donator odorant rotonda tornado endoscopic picosecond enumeration mountaineer friary rarify license selenic silence obscurantist subtractions predicts scripted striptease tapestries shower whores
A design error
When I was comparing the relative efficiency of the algorithm in Perl and in C, I realized that the C program was finding about 80 anagrams less than the Perl one. I dumped both result lists to two files and I had a look at them with diff. The Perl scripts finds anagrams like
angrier earring rearing
arches chaser search

while the C program only gets
angrier earring
arches chaser

I cross-checked the C source code twice, and when I was convinced that the program was bug-free, I also checked the intermediate lists produced after the signing and sorting pass. I was even suspecting a bug in the GNU sort program (forgive me, RMS), but I dismissed it after a thorough test, comparing the result of different sorting algorithms in C and Perl. No bugs there either.
Then, I started thinking about the design, and eventually I found the real reason, which is in one assumption of the algorithm. The Perl implementation seems to be equivalent to the C one, but it is not. These 80 anagrams more are the result of better accuracy of the hash against the sort/squash approach.
The wrong assumption is that the sort program is sorting the signatures independently from its values. However, signatures and values are *in the same string*, thus the separating space and the beginning of the value words are evaluated into the sorting as well as any other character.
Using a hash, instead, guarantees that the signatures are independent from the associated values.
The C algorithm, then, although faster, is less accurate, and needs some re-thinking, using a supporting data structure or using a customized sorting algorithm (by the first item or by length and contents).
(Having some knowledge of the GNU sort utility, however, the problem would be solved by using sort -k 1,1 , which sorts by the first word only. But, as Hannibal Lecter would say, this is incidental).

A design improvement
I have mentioned several times that the C program is faster than my Perl script. But to be completely honest, I was not comparing the same thing. One of the reasons for which the script is slow is that I don't trust the input. My application is doing two things more than the C program: converting the input to lowercase and checking for duplicates, which were taken for granted in the original design.
What I should be comparing, instead, is a script that, given in input 'Unix' and 'UNIX', reports them as anagrams. But at the moment I'd rather wait two seconds more and enjoy an accurate output.

I didn't want to demonstrate that Perl is better than C (everybody in this site has already some ideas on this issue, I suspect) but I wanted to stress some useful aspects of software construction in Perl. The richness of features can greatly simplify the task of implementing a data manipulation algorithm. The difference in performance is compensated by the greater accuracy of the ouput and the ease of coding, which also accounts for ease of maintenance and adaptation to similar tasks (eg: other languages dictionaries).
I hope this exercise could be useful to somebody.

petral found an efficiency improvement of the anagram algorythm.
A hint by TomK32 led to an interesting personal challenge, ie multiple words anagrams, whose first shot is here.
 _  _ _  _  
(_|| | |(_|><

In reply to Perl's pearls by gmax

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (4)
    As of 2014-11-27 23:11 GMT
    Find Nodes?
      Voting Booth?

      My preferred Perl binaries come from:

      Results (190 votes), past polls