Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

RegEx + vs. {1,}

by Sewi (Friar)
on Oct 10, 2012 at 11:03 UTC ( #998200=perlquestion: print w/ replies, xml ) Need Help??
Sewi has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
a really simple RegEx problem is driving me crazy trying to find the most-often recurring char combination:
perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $2 if $x + =~ /((\w+).*?\2.*?)+/;' abcdefgxxabcdefgzzabcdsjfhkdfab ab
Simple, but I need to match only strings with a minimum of 2 chars:
perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $2 if $x + =~ /((\w{2,}).*?\2.*?)+/;' abcdefgxxabcdefgzzabcdsjfhkdfab abcdefg
Shows abcdefg but ab would be right (there is another abc behind the zz and another ab at the end).
perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $2 if $x + =~ /((\w{2,}?).*?\2.*?)+/;' abcdefgxxabcdefgzzabcdsjfhkdfab cd
doesn't work either, as ab would be right.
Why doesn't {2,} match ab?
Thanks, Sewi

Comment on RegEx + vs. {1,}
Select or Download Code
Re: RegEx + vs. {1,}
by grizzley (Chaplain) on Oct 10, 2012 at 12:01 UTC
    Because you are trying to match string which occurs twice and 'abcdefg' is first correct candidate. I can think only about such approach for your problem:
    $ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if +$x =~ /(\w{2,})(.*?\1){2}/;' abcdefgxxabcdefgzzabcdsjfhkdfab abcd $ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if +$x =~ /(\w{2,})(.*?\1){3}/;' abcdefgxxabcdefgzzabcdsjfhkdfab ab $ perl -le 'print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; print $1 if +$x =~ /(\w{2,})(.*?\1){4}/;' abcdefgxxabcdefgzzabcdsjfhkdfab
    I.e. The problem is you have to say exactly how many occurences you want (there is no 'greediness' in this case == you can't say "I want as many occurences as possible", only lowest possible number of occurences will be chosen).
        So if that's acceptable for you - use while loop to determine max amount of occurences. There will be no more than length / 2 occurences, so start with this max value and decrease it while trying to match:
        $x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; $len=int(length($x)/2); while($x !~ /(\w{2,})(.*?\1){$len}/) { $len-- }; $x =~ /(\w{2,})(.*?\1){$len}/; # 'strange line' print $1
        (to self: do not know why I have to add 'strange line', without it nothing is printed, but $len is correctly set to 4)

        I tried to generate the list and include it in one regexp:

        $ perl -le '$x = "abcdefgxxabcdefgzzabcdsjfhkdfab"; $len=int(length($x +)/2); $restring = join"|", map {"(?:.*?\\1){$_}"} reverse(1..$len); p +rint $restring; print $1 if $x =~ /(\w{2,})($restring)/;' (?:.*?\1){15}|(?:.*?\1){14}|(?:.*?\1){13}|(?:.*?\1){12}|(?:.*?\1){11}| +(?:.*?\1){10}|(?:.*?\1){9}|(?:.*?\1){8}|(?:.*?\1){7}|(?:.*?\1){6}|(?: +.*?\1){5}|(?:.*?\1){4}|(?:.*?\1){3}|(?:.*?\1){2}|(?:.*?\1){1} abcdefg
        but it does not work as expected (probably some stupid mistake, maybe someone else can tell what's wrong with it).
Re: RegEx + vs. {1,}
by trizen (Friar) on Oct 10, 2012 at 12:25 UTC
    my $string = 'abcdefgxxabcdefgzzabcdsjfhkdfab'; my %seen; 1 while $string =~ /(\w{2,}?)(?{$seen{$1}++})(?!)/g; print [sort { $seen{$b} <=> $seen{$a} || length($b) <=> length($a) } keys %seen]->[0];
    Update: Thanks Athanasius for your appreciation && corrections.

      Thank-you for this!

      I would have forced backtracking by setting pos:

      my %seen; for my $len (2 .. length $string) { ++$seen{$1}, pos($string) -= length($1) - 1 while $string =~ /(\w{ +$len})/g; }

      but this is clumsy and verbose beside your elegant one-liner!

      Took me a while to figure out what (?!) was doing. YAPE::Regex::Explain told me:

      ---------------------------------------------------------------------- (?{$seen{$1}++}) run this block of Perl code ---------------------------------------------------------------------- (?!) fail ----------------------------------------------------------------------

      and in Extended Patterns I found:

      (*FAIL) (*F) This pattern matches nothing and always fails. It can be used to force + the engine to backtrack. It is equivalent to (?!), but easier to rea +d. In fact, (?!) gets optimised into (*FAIL) internally.

      So I’ve learned something that’s really useful.

      Now a couple of minor quibbles:

      1. The non-greedy quantifier in (\w{2,}?) is redundant, since the forced backtracking ensures that all substrings will be found whether the search is greedy or not.
      2. If, as per your sort, you are looking for only the shortest substring with the maximum frequency, then you need only search for substrings exactly 2 characters long (the minimum length): (\w{2}). (Proof: Say XYZ occurs N times. XY and YZ occur once in each occurrence of XYZ, so each is guaranteed to occur at least N times in the string. Similar reasoning applies to longer substrings.)

      ++trizen for demonstrating this elegant technique to force regex backtracking!

      Update: Because (?!) forces backtracking via repeatedly-failing matches, neither the while loop nor the /g modifier are necessary:

      $string =~ /(\w{2,})(?{$seen{$1}++})(?!)/;

      does the identical job! (as is easily confirmed by dumping the contents of %seen).

      Athanasius <°(((><contra mundum

Re: RegEx + vs. {1,}
by Anique (Acolyte) on Oct 10, 2012 at 12:26 UTC

    The problem is not with the (\w{2,}?), because it will happily match to ab in

    $x =~ /((\w{2,}?).*\g2.*?)+/;

    and

    $x =~ /((\w{2,}?).*?\g2.*?)+?/;

    You are asking the regex to give you a sequence which is repeated, and you can either get the longest match (greedy) or the shortest one (ungreedy)

    The only solution for your problem that I can think of, is to capture all matches that this regex finds (list context), then counting how often each captured sequence occurs, and selecting the one that occurs most often.

Re: RegEx + vs. {1,}
by MidLifeXis (Prior) on Oct 10, 2012 at 12:32 UTC

    Regexp::Debugger might be helpful to visualize what is happening here.

    --MidLifeXis

Re: RegEx + vs. {1,}
by ELISHEVA (Prior) on Oct 10, 2012 at 14:16 UTC

    If you want a list of all two letter patterns that appear at least twice somewhere in your string, you need to make three changes to your regex.

    1. you need to make (\w{2,}) non-greedy by adding a "?" to the end, e.g. (\w{2,}?).
    2. you need to wrap what comes after (\w{2,}?) with a zero width lookahead group. Otherwise you will miss all the matches between the first and second occurrence of "ab"
    3. you need to handle repetitions of your regex slightly differently. Instead of /( mumblefoo )+/ you need /mumblefoo/g. Using a + the way you did will only get you the last match found because each time the + causes the regex to repeat, it replaces the previous match.

    Taken together these changes will make your regex will look like this: /(\w{2,}?)(?=.*?\1)/g:

    print $x = "abcdefgxxabcdefgzzabcdsjfhkdfab", "\n"; print "<" . join('|',$x =~ /(\w{2,}?)(?=.*?\1)/g) , ">\n"; #outputs: <ab|cd|ef|ab|cd|ab>

    You can more info on zerolength lookaheads via the Extended Patterns section of the perlre manpage on perldoc

      Or, just remove the comma from the quantifier: \w{2}
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Indeed. There can never be a three character sequence in your string which occurs more frequently than a two character sequence. (Because the three character sequence contains two two character sequences.)

        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998200]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2014-08-01 09:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (258 votes), past polls