http://www.perlmonks.org?node_id=998215


in reply to RegEx + vs. {1,}

my $string = 'abcdefgxxabcdefgzzabcdsjfhkdfab'; my %seen; 1 while $string =~ /(\w{2,}?)(?{$seen{$1}++})(?!)/g; print [sort { $seen{$b} <=> $seen{$a} || length($b) <=> length($a) } keys %seen]->[0];
Update: Thanks Athanasius for your appreciation && corrections.

Replies are listed 'Best First'.
Re^2: RegEx + vs. {1,}
by Athanasius (Archbishop) on Oct 10, 2012 at 16:03 UTC

    Thank-you for this!

    I would have forced backtracking by setting pos:

    my %seen; for my $len (2 .. length $string) { ++$seen{$1}, pos($string) -= length($1) - 1 while $string =~ /(\w{ +$len})/g; }

    but this is clumsy and verbose beside your elegant one-liner!

    Took me a while to figure out what (?!) was doing. YAPE::Regex::Explain told me:

    ---------------------------------------------------------------------- (?{$seen{$1}++}) run this block of Perl code ---------------------------------------------------------------------- (?!) fail ----------------------------------------------------------------------

    and in Extended Patterns I found:

    (*FAIL) (*F) This pattern matches nothing and always fails. It can be used to force + the engine to backtrack. It is equivalent to (?!), but easier to rea +d. In fact, (?!) gets optimised into (*FAIL) internally.

    So I’ve learned something that’s really useful.

    Now a couple of minor quibbles:

    1. The non-greedy quantifier in (\w{2,}?) is redundant, since the forced backtracking ensures that all substrings will be found whether the search is greedy or not.
    2. If, as per your sort, you are looking for only the shortest substring with the maximum frequency, then you need only search for substrings exactly 2 characters long (the minimum length): (\w{2}). (Proof: Say XYZ occurs N times. XY and YZ occur once in each occurrence of XYZ, so each is guaranteed to occur at least N times in the string. Similar reasoning applies to longer substrings.)

    ++trizen for demonstrating this elegant technique to force regex backtracking!

    Update: Because (?!) forces backtracking via repeatedly-failing matches, neither the while loop nor the /g modifier are necessary:

    $string =~ /(\w{2,})(?{$seen{$1}++})(?!)/;

    does the identical job! (as is easily confirmed by dumping the contents of %seen).

    Athanasius <°(((><contra mundum