|Problems? Is your data what you think it is?|
Dear fellow Monks,
I would like to share with you some notes on software implementation. Some of the technical contents of this node have already been discussed in the Monastery, but I would like to present the issue in a more general context.
In his book "Programming Pearls", Jon Bentley describes an algorithm to find all anagrams in a dictionary. This is an interesting problem, because, until not long ago, CS textbooks used to present anagrams as a task for a brute force recursive attack, close to intractability. Bentley's refreshing view has shown us that several seemingly gigantic chores, when faced from a different angle, become almost routine.
A simplified description of this anagram method is a four steps task.
1. for each word in the dictionary, find an unique "signature" by sorting the word's letters.
2. print a space-separated list of the signature plus the word;
3. sort the list;
4. read into a new list, and put together the words with the same signature (adjacent within the list). They are the anagrams.
5. (not mentioned, but necessary) filter from the list the signatures with more than one word and sort them.
His implementation in "C" uses two programs (one to create the "signatures" and the second one to flatten two or more adjacent entries into a single one) plus the system sort. Even if he doesn't mention it, a third program is also needed, to filter off from the final list the words without anagram.
I wanted to test the implementation of the same algorithm in Perl. It is much simpler, resulting in one script of 15 lines (some Golf implementations can be seen at this node and this one). The external sorting plus the flattening program are replaced by inserting words in a hash, which deals very efficiently with the problem. No external filtering is needed. The final two steps, filtering and printing, are combined into one step only in Perl.
Of course it is not as fast as C, but the complexity of the Perl program is less than one third. The three C programs together amount to 50 lines of code, and the compiled results have to be put together with two calls to the system sort
./sign < dictionary | sort | ./squash | ./filter_anagrams |sort
The Perl script is simpler and more powerful than its C counterparts. If I wanted to implement the same handling of input/output as I have done in my Perl application it wouldn't be trivial. Having had my share of C programming, I know for a fact that doing in C what boils in those 15 lines of Perl can make me learn many new curses. :-)
Back to our anagrams in Perl. The principle is simple: a hash will create the candidates, using the sorted letters of each word as key (the "signature") so that two or more words with the same signature will end up together.
Here is the pseudocode
example: given this input file
%words will become (notice that the duplicates in BABY, Baby and UNIX were skipped)
and then, skipping the unique items "abbot" and "inux", it will print (only values)
Enough introduction. Here is the actual working script
And now, some Perl technical notes
(A) The last line of the script is roughly equivalent to the following
(B) Rejecting duplicates could be only *slightly* faster if implemented with grep
(C) Using an anonymous array as hash value, instead of a space-separated string, makes the program shorter, but (surprisingly) slower.
After benchmarking, it seems that adding and fetching a string is less expensive than pushing and fetching (with interpolation) an array, while counting items is faster in arrays. I wouldn't go into more details here, since this node is already a long one.
(D) Creating the word signature with
pack "C*", sort unpack "C*", $_
is faster than using
join "", sort split //, $_;
(E) To make anagrams from words with accented characters (like in French and Italian,) tr should produce a better result than lc
To obtain a significant sample, I merged the standard Unix "words" dictionary, the American and British English dictionaries from the GNU ispell package (eg: strings british.hash | perl -ne 'print if /^\w+$/') and removed duplicated words. The total was about 100_000 items. In less than 5 seconds (Linux on a PentiumIII 700) I got a list of more than 14_000 anagrams, from which I report an interesting excerpt:
A design error
When I was comparing the relative efficiency of the algorithm in Perl and in C, I realized that the C program was finding about 80 anagrams less than the Perl one. I dumped both result lists to two files and I had a look at them with diff. The Perl scripts finds anagrams like
angrier earring rearing
arches chaser search
while the C program only gets
I cross-checked the C source code twice, and when I was convinced that the program was bug-free, I also checked the intermediate lists produced after the signing and sorting pass. I was even suspecting a bug in the GNU sort program (forgive me, RMS), but I dismissed it after a thorough test, comparing the result of different sorting algorithms in C and Perl. No bugs there either.
Then, I started thinking about the design, and eventually I found the real reason, which is in one assumption of the algorithm. The Perl implementation seems to be equivalent to the C one, but it is not. These 80 anagrams more are the result of better accuracy of the hash against the sort/squash approach.
The wrong assumption is that the sort program is sorting the signatures independently from its values. However, signatures and values are *in the same string*, thus the separating space and the beginning of the value words are evaluated into the sorting as well as any other character.
Using a hash, instead, guarantees that the signatures are independent from the associated values.
The C algorithm, then, although faster, is less accurate, and needs some re-thinking, using a supporting data structure or using a customized sorting algorithm (by the first item or by length and contents).
(Having some knowledge of the GNU sort utility, however, the problem would be solved by using sort -k 1,1 , which sorts by the first word only. But, as Hannibal Lecter would say, this is incidental).
A design improvement
I have mentioned several times that the C program is faster than my Perl script. But to be completely honest, I was not comparing the same thing. One of the reasons for which the script is slow is that I don't trust the input. My application is doing two things more than the C program: converting the input to lowercase and checking for duplicates, which were taken for granted in the original design.
What I should be comparing, instead, is a script that, given in input 'Unix' and 'UNIX', reports them as anagrams. But at the moment I'd rather wait two seconds more and enjoy an accurate output.
I didn't want to demonstrate that Perl is better than C (everybody in this site has already some ideas on this issue, I suspect) but I wanted to stress some useful aspects of software construction in Perl. The richness of features can greatly simplify the task of implementing a data manipulation algorithm. The difference in performance is compensated by the greater accuracy of the ouput and the ease of coding, which also accounts for ease of maintenance and adaptation to similar tasks (eg: other languages dictionaries).
I hope this exercise could be useful to somebody.
petral found an efficiency improvement of the anagram algorythm.
A hint by TomK32 led to an interesting personal challenge, ie multiple words anagrams, whose first shot is here.
_ _ _ _ (_|| | |(_|>< _|