Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

match longest sequence in an "or" RE

by silentius (Scribe)
on Sep 24, 2011 at 16:40 UTC ( #927659=perlquestion: print w/replies, xml ) Need Help??
silentius has asked for the wisdom of the Perl Monks concerning the following question:

Much Revered Monks, I come humbly to your presence to ask for light in a regular expression issue,

I need to check whether a certain line matches any of a set of patterns.

I have: if ($line =~ $rex) { ... } where $rex = qr/^.+(s m e f|s f pl|s f).+$/;

How do I make it match the longest "or" possible? If "s f pl" is in the line I'd like that to be the match, and NOT "s f". I wrote the "or" patterns in decreasing order of length from left to right, but it is matching just "s f" even when it could match "s f pl".

Revered Monks, if you can shed any light on this, I'd be most grateful.

humbly, silentius

Replies are listed 'Best First'.
Re: match longest sequence in an "or" RE
by Corion (Pope) on Sep 24, 2011 at 16:43 UTC

    Please show more of your code. The following works for me as you seem to want:

    > perl -wle "$_='foo s f pl bar';print for /^.+(s m e f|s f pl|s f).+$ +/;" s f pl

    Most likely, $line does not contain what you think it contains, or maybe there is nothing after s f pl in your input string.

      You are right! There was nothing after the "s f pl".

      Thank you very much!

Re: match longest sequence in an "or" RE
by AnomalousMonk (Chancellor) on Sep 24, 2011 at 19:08 UTC
    ... match the longest "or" possible ...

    The Perl regex engine matches the leftmost longest substring for a given pattern; leftmost trumps longest. If you want to match the longest substring that occurs anywhere in the string (which, IIRC, is the POSIX definition of matching), you have to jump through a hoop or two:

    >perl -wMstrict -le "my $s = 'x s m e f s f pl s f x'; ;; $s =~ qr/^.+(s m e f|s f pl|s f).+$/; print qq{'$1'}; print qq{ -------}; ;; my $alt = qr{ s[ ]m[ ]e[ ]f | s[ ]f[ ]pl | s[ ]x }xms; for my $s ( 'x s m e f s f pl s f x', 'x s f s f pl s m e f x', 'x s f s m e f s f pl x', ) { my ($longest) = reverse sort { length($a) <=> length($b) } $s =~ m{ (?<= .) $alt (?= .) }xmsg ; print qq{'$longest'}; } " 's f' ------- 's m e f' 's m e f' 's m e f'

    I think there is another, possibly more succinct, solution using the 5.10+ Special Backtracking Control Verbs, but I can't remember it ATM.

    Update: Well, this doesn't use backtracking control verbs quite the way I imagined, but at least it gets rid of the list-making and sorting:

    >perl -wMstrict -le "my $alt = qr{ s[ ]m[ ]e[ ]f | s[ ]f[ ]pl | s[ ]f }xms; my $rex = qr{ (?<= .) $alt (?= .) }xms; ;; for my $s ( 'x s m e f s f pl s f x', 'x s f s f pl s m e f x', 'x s f s m e f s f pl x', 'x s f s f x', 'x s s x', 's f', 's m e f', ) { local our $longest; local our $offset; use re 'eval'; $s =~ m{ ($rex) (?{ ($longest, $offset) = ($1, $-[1]) if ! defined($longest) || length($1) > length($longest) }) (*SKIP)(*FAIL) }xmsg; print qq{'$s' -> '$longest' at $offset} if defined $longest; } " 'x s m e f s f pl s f x' -> 's m e f' at 3 'x s f s f pl s m e f x' -> 's m e f' at 16 'x s f s m e f s f pl x' -> 's m e f' at 8 'x s f s f x' -> 's f' at 3

    Note that in the string  'x  s f  s f  x' the left-most occurrence of 's f' is found. This can be changed to the right-most by changing the comparison operator in  length($1) > length($longest) to >= instead.

      I think one of the problems is the greedy .+ at the start of the regex. It would be better to leave it off or make it nongreedy. Then, just sorting by length decreasing makes the thing work again:

      >perl -wMstrict -lE "say /^.+?(s m e f|s f pl|s f).+$/ for '-s m e f s + f pl s f-';" s m e f

        It's necessary to leave off the first  .+ entirely and sort all the captured alternations by length decreasing (or some similar approach). Just making the first  .+ non-greedy is not sufficient to find the longest match anywhere in the string:

        >perl -wMstrict -lE "say q{'}, /^.+?(s m e f|s f pl|s f).+$/, q{' at offset }, $-[1] for '-s f s f pl s m e f-'; " 's f' at offset 1
      There is leftmost, and there is leftmost. Perl regular expressions do both. POSIX regular expressions do just one of them.

      Oh, which leftmosts are there? Well, there's the leftmost starting point in the target string, and there's the leftmost choice in alternation.

      POSIX regular expressions find the leftmost match in the target string, but from all possible matches from that starting point, it will return the longest.

Re: match longest sequence in an "or" RE
by Anonymous Monk on Sep 24, 2011 at 16:47 UTC
    That looks like it should work.
    $ say /^.+(s m e f|s f pl|s f).+$/ for "-s f-", "-s f pl-"; s f s f pl
      >perl -wMstrict -lE "say /^.+(s m e f|s f pl|s f).+$/ for '-s m e f s f pl s f-'; " s f
Re: match longest sequence in an "or" RE
by Khen1950fx (Canon) on Sep 25, 2011 at 00:49 UTC
    I couldn't get your regex to work for me, so I was thinking that, while regexen are usually fast and compact, using a regex here is actually slowing things down dramatically.

    I would use Algorithm::Diff::XS here. For example,

    #!/usr/bin/perl -l use strict; use warnings; use Algorithm::Diff::XS qw(LCS); my(@a) = split (//, q[a c s f 'pl' j k s], 0); my(@b)= split(//, q[m e f z x s f 'pl'], 0); print LCS( \@a,\@b );
    prints out s f 'pl', but it's a lot less work than using a regexp. It can be modified to fit into your decreasing order scheme.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://927659]
Approved by Perlbotics
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2018-02-21 12:16 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (279 votes). Check out past polls.