Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
go ahead... be a heretic
 
PerlMonks  

Re^2: RegEx + vs. {1,}

by Athanasius (Prior)
on Oct 10, 2012 at 16:03 UTC ( #998250=note: print w/ replies, xml ) Need Help??


in reply to Re: RegEx + vs. {1,}
in thread RegEx + vs. {1,}

Thank-you for this!

I would have forced backtracking by setting pos:

my %seen; for my $len (2 .. length $string) { ++$seen{$1}, pos($string) -= length($1) - 1 while $string =~ /(\w{ +$len})/g; }

but this is clumsy and verbose beside your elegant one-liner!

Took me a while to figure out what (?!) was doing. YAPE::Regex::Explain told me:

---------------------------------------------------------------------- (?{$seen{$1}++}) run this block of Perl code ---------------------------------------------------------------------- (?!) fail ----------------------------------------------------------------------

and in Extended Patterns I found:

(*FAIL) (*F) This pattern matches nothing and always fails. It can be used to force + the engine to backtrack. It is equivalent to (?!), but easier to rea +d. In fact, (?!) gets optimised into (*FAIL) internally.

So I’ve learned something that’s really useful.

Now a couple of minor quibbles:

  1. The non-greedy quantifier in (\w{2,}?) is redundant, since the forced backtracking ensures that all substrings will be found whether the search is greedy or not.
  2. If, as per your sort, you are looking for only the shortest substring with the maximum frequency, then you need only search for substrings exactly 2 characters long (the minimum length): (\w{2}). (Proof: Say XYZ occurs N times. XY and YZ occur once in each occurrence of XYZ, so each is guaranteed to occur at least N times in the string. Similar reasoning applies to longer substrings.)

++trizen for demonstrating this elegant technique to force regex backtracking!

Update: Because (?!) forces backtracking via repeatedly-failing matches, neither the while loop nor the /g modifier are necessary:

$string =~ /(\w{2,})(?{$seen{$1}++})(?!)/;

does the identical job! (as is easily confirmed by dumping the contents of %seen).

Athanasius <°(((><contra mundum


Comment on Re^2: RegEx + vs. {1,}
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998250]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2014-04-20 19:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (486 votes), past polls