|Perl: the Markov chain saw|
Re: Re: Re: Re: Perl's pearlsby gmax (Abbot)
|on Jan 02, 2002 at 11:05 UTC||Need Help??|
I have to confess that, for my relatively young Perl experience, this second example is not as easy to read as the first one.
it becomes almost 20% faster.
Further difference in speed should be blamed on the efficiency of strings vs arrays (more on this topic later).
My school of programming is quite pragmatic. Since I usually work with large chunks of data that I get from databases, I learned how to minimize the heavy loads in a program.
In this particular case, I know that we are dealing with a potentially huge hash. Every unnecessary access to such structure makes the program slower.
I want to find an acceptable compromise between readability and efficiency. Maximum efficiency could be achieved at the price of readability, and maximum readability could challenge efficiency.
About readability, while there are style principles that might give a common base, reading advanced programs requires some advanced knowledge. Therefore, readability is subjective, and it is a blend of language knowledge and style principles.
Back to business, I made a new version of my script, modified for speed. No warnings, no strict and no declarations (but I tried it with eveything enabled before presenting it here). I think it is easily readable, except maybe the last line, for which I made an explanation in the main node (remembering my first days with Perl).
This script touches the hash three times directly plus two times conditionally. The first access is made with the exists function. If this test is true, two more accesses to the hash are performed, but only to those items that have anagrams or duplicates. In our case, about 15% of the items). Then we access the hash to insert the words and to get the results. Only once.
It runs under 4 seconds for those 100_000 words that I collected, while merlyn's second script runs in 6.7 seconds.
I don't want to start a competition with anybody (especially not merlyn, whom I admire and respect,) but I would just like to point out that my script, more than a matter of taste, is the result of some research on efficiency issues, as I have already stated in my main node.
I benchmarked the resource consuming parts of this short script, and my choice of pack vs split and strings vs arrays is due to the timing of the relative performance.
In particular, I went to benchmark extensively the performance of hashes of strings vs hashes of arrays.
There are three operations that affect this data structure in our anagrams application
1. append an item at the end;
2. count how many items in my array or string;
3. fetch all the items at once (string interpolation);
In two of these operations, (1 and 3) strings are faster. If my application only needs operations 1 to 3 (ie with no access to the items individually), strings are still faster, since the speed for insertion and fetching compensates for the slower counting. Arrays are faster only if I want to access items one by one.
An explanation is necessary for the slower performance of arrays in string interpolation.
The above code fragment shows the effects of string interpolation. An array is merged into a string with its items separated by a space. This is standard Perl behavior. This operation is roughly the same as using join on the array explicitly and this fact should account for the slower performance.
For small hashes the difference is almost insignificant, and thus I would prefer to use an array, to have a more clean data structure. In my anagrams script I preferred the strings because I am dealing with potentially huge input.
The following is my benchmarking code that I used to evaluate the relative speed of these structures.
_ _ _ _ (_|| | |(_|>< _|