my %seen;
for my $len (2 .. length $string)
{
++$seen{$1}, pos($string) -= length($1) - 1 while $string =~ /(\w{
+$len})/g;
}
but this is clumsy and verbose beside your elegant one-liner!
Took me a while to figure out what (?!) was doing. YAPE::Regex::Explain told me:
----------------------------------------------------------------------
(?{$seen{$1}++}) run this block of Perl code
----------------------------------------------------------------------
(?!) fail
----------------------------------------------------------------------
and in Extended Patterns I found:
(*FAIL) (*F)
This pattern matches nothing and always fails. It can be used to force
+ the engine to backtrack. It is equivalent to (?!), but easier to rea
+d. In fact, (?!) gets optimised into (*FAIL) internally.
So I’ve learned something that’s really useful.
Now a couple of minor quibbles:
- The non-greedy quantifier in (\w{2,}?) is redundant, since the forced backtracking ensures that all substrings will be found whether the search is greedy or not.
- If, as per your sort, you are looking for only the shortest substring with the maximum frequency, then you need only search for substrings exactly 2 characters long (the minimum length): (\w{2}). (Proof: Say XYZ occurs N times. XY and YZ occur once in each occurrence of XYZ, so each is guaranteed to occur at least N times in the string. Similar reasoning applies to longer substrings.)
++trizen for demonstrating this elegant technique to force regex backtracking!
Update: Because (?!) forces backtracking via repeatedly-failing matches, neither the while loop nor the /g modifier are necessary:
$string =~ /(\w{2,})(?{$seen{$1}++})(?!)/;
does the identical job! (as is easily confirmed by dumping the contents of %seen).
Athanasius <°(((>< contra mundum
|