comment on

I couldn't leave well enough alone on this...

The regexp solution benefits from the very efficient regexp engine. But it is a solution that is built upon a big-O polynomial algorithm. If we expand the problem to finding uniqueness in strings consisting of three-character-wide groups of alphabetical characters, that gives us a lot of room for dataset growth while maintaining a string of unique groups. The hash solution grows at O(n) since each hash insert occurs at an average of O(1). I can't quite figure out how bad the regular expression approach gets as the string grows, but it's probably something like O(n^2) or worse.

For short test strings the raw speed of the regexp engine wins over the complexity of the hashing algorithm. But for longer strings, there's literally no comparison. Here's some test code:

use strict;
use warnings;

use Benchmark qw( cmpthese ) ;

use vars qw/$tuplets $template/;

$tuplets = join '', ( 'aaa' .. 'caa' );
$template = join '', 'a3' x ( length( $tuplets ) / 3 );

print "Test string contains ", 
      length( $tuplets ) / 3, 
      " groups.\n\n";

cmpthese( 
    -10, 
    {
        regexp => sub {
            return $tuplets !~ /^(?:.{3})*(.{3})(?:.{3})*\1/;
        },
        hash => sub {
            my %hash;
            @hash{ unpack $template, $tuplets } = ();
            return( 
                length( $tuplets ) / 3 == keys( %hash )
            );
        }
    }
);
[download]

And the results on my slow Pentium-II laptop:

Test string contains 1353 groups.

          s/iter regexp   hash
regexp      1.15     --   -98%
hash   1.84e-002  6123%     --
[download]

At first I thought my eyes were decieving me. 1.84e-002 iterations per second? That's horrible. But then I realized that the regexp solution was so slow that Benchmark switched to showing seconds per iteration. So it takes 1.15 seconds per iteration for the regexp approach in my test example, and a blink of an eye (1.84e-002) for the hash approach with a test string of 1353 groups. Try testing 'aaa' .. 'faa'. You'll have to increase the testing time about a minute to even get reliable results out of Benchmark at that point because the regexp approach becomes so sluggish.

Of course this is a contrived example, but aren't they all? ;) And I did have to modify the RE a little so that it would maintain proper framing. But the discussion caught my attention and I just had to prove to myself what I already suspected.

Dave

In reply to Re: Determining uniqueness in a string. by davido
in thread Determining uniqueness in a string. by Yzzyx

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks