http://www.perlmonks.org?node_id=802764

pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I want to use a regex to extract ALL character n-tuplets from a given input string irrespective of case. To do so, I have written the following little script:
#! /usr/bin/perl -w use strict; my $a = "Input String: aaAabBAAa"; my $n = shift @ARGV; $n --; while ( $a =~ /(.)\1{$n}/ig ) { print "$&\n"; pos( $a ) = pos( $a ) - length( $& ) +1; }

For $n = 2, the script provides the correct output:
aa aA Aa bB AA Aa
I am wondering if there is a more elegant way to do this, e.g. by setting a suitable modifier on the regex. The manual adjustment of the match position after each regex match strikes me as cumbersome.

Any suggestions for improvement?

Thanks for your help in advance and best regards -

Pat

Replies are listed 'Best First'.
Re: Is there an elegant way of resetting the match position?
by BrowserUk (Patriarch) on Oct 22, 2009 at 18:36 UTC

    Try

    [0] Perl> $s = "Input String: aaAabBAAa";; [0] Perl> $n=1; print "'$1'" while $s =~ m[(?=((.)\2{$n}))]ig;; 'aa' 'aA' 'Aa' 'bB' 'AA' 'Aa' [0] Perl> $n=2; print "'$1'" while $s =~ m[(?=((.)\2{$n}))]ig;; 'aaA' 'aAa' 'AAa'

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      thanx ...
      DB<1> $s = "aaAabBAAa";; DB<2> $n=1; print "'$1'",pos($s) while $s =~ m[(?=((.)\2{$n}))]ig 'aa'0'aA'1'Aa'2'bB'4'AA'6'Aa'7 DB<3> $n=1; print "'$1$2'",pos($s) while $s =~ m[(.)(?=(\1{$n}))]ig 'aa'1'aA'2'Aa'3'bB'5'AA'7'Aa'8 DB<4> $n=2; print "'$1$2'",pos($s) while $s =~ m[(.)(?=(\1{$n}))]ig 'aaA'1'aAa'2'AAa'7

      ... fascinating to see capture work within a lookahead and to investigate how pos() behaves...

      Cheers Rolf

Re: Is there an elegant way of resetting the match position?
by bv (Friar) on Oct 22, 2009 at 17:04 UTC

    If you use a lookahead regex, you can avoid resetting the position after every match. But the real issue (in my mind) is that you use $&, which incurs a significant performance hit. Here's a solution that produces the same output using lookaheads and substr:

    #! /usr/bin/perl -w use strict; use warnings; my $a = "Input String: aaAabBAAa"; my $n = shift @ARGV; $n --; my $rx = qr/(.)(?=\1{$n})/i; while ( $a =~ /$rx/g ) { print substr( $a, pos($a)-1, $n+1), "\n"; }

    print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))
Re: Is there an elegant way of resetting the match position?
by LanX (Saint) on Oct 22, 2009 at 17:46 UTC
    try this

    lanx@nc10-ubuntu:~$ perl -de0 ... DB<1> $a="aaAabBAAa"; DB<2> print join " ",($a =~ /((.)\G\2|(.)\3)/gi) aa a aA a Aa A bB b AA A Aa A

    there should be a way to backreference matches without listing them afterwards, but I have no time to look it up.

    otherwise:

    DB<3> print $&," " while ($a =~ /(.)\G\1|(.)\2/gi) aa aA Aa bB AA Aa

    Hope it helps!

    Cheers Rolf

Re: Is there an elegant way of resetting the match position?
by gmargo (Hermit) on Oct 22, 2009 at 18:17 UTC

    My inclination was to use a second capture, and then decrement the position. What is the performance penalty for changing the search position? More than just saving an index value?

    #! /usr/bin/perl -w use strict; use warnings; my $a = "Input String: aaAabBAAa"; my $n = shift @ARGV; $n --; while ( $a =~ /((.)\2{$n})/ig ) { print "$1\n"; pos($a) -= $n; }

    Of course, this doesn't really answer the elegance question, darn it.

Re: Is there an elegant way of resetting the match position?
by GrandFather (Saint) on Oct 22, 2009 at 20:02 UTC

    Maybe you could generate all 'pairs' (which scales nicely to N) then filter them:

    use strict; use warnings; my $str = 'aaAabBAAa'; my $N = 2; print join "\n", grep {/^(.)\1+$/i} map {substr $str, $_, $N} 0 .. length ($str) - $N;

    Prints:

    aa aA Aa bB AA Aa

    Update: Changed to match N adjacent characters.


    True laziness is hard work
Re: Is there an elegant way of resetting the match position?
by JadeNB (Chaplain) on Oct 22, 2009 at 21:45 UTC

    The following seems to me to be linear-time *, as opposed to your algorithm, which looks quadratic (but eye-balling running times is notoriously unreliable, especially for me!):

    my @runs; my $pos = 0; my $b = $a; my $c = lc substr $b, 0, 1, ''; my $next; while ( $b ) { my $start = $pos; 1 while ++$pos and ( $next = lc substr $b, 0, 1, '' ) eq $c; for my $i ( $start .. $pos - $n ) { push @runs, substr $a, $i, $n; } $c = $next; }

    UPDATE: Wow, I completely misread! My original version just printed out all runs of a repeated character. I've fixed it now.
    * At the expense of not using regexes, which you specifically requested. Oh, well. Is there a reason that you prefer to use them?
    UPDATE 2: Whoops, I was eating one character too many each pass through the loop.
    UPDATE 3: OK, now I've actually tested it. Sorry about that. :-)